Title: Analysis of King County Streams Water Quality Index¶

Author: MG¶

Date: Oct 19, 2023¶


Description¶

This document follows the rep_py.Rmd which prepared the .csv which is used as a base for the research. The document shows the analysis of the Water Quality Index (WQI) in King County, WA, for the period 1970-2023.

The analysis answers two questions raised in rep_py.Rmd and provides additional insights and tools to perform exploratory data analysis.

  1. Q1: How has WQI changed over the years? Was WQI better in the past?

  2. Q2: How does WQI change with locators' geography? Is there any pattern such as more densely populated areas have WQI worse than sparsely populated areas? Does any of the areas such as north/south/east/west have cleaner water over the others?

The answers to those questions can be found in the Summary section.

In [ ]:
# for exporting externally and hide the code-cells in the export file do:
# jupyter nbconvert --to html --no-input file.ipynb
import plotly.io as pio
pio.renderers.default='notebook'

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# for maps
import plotly.express as px
import plotly.graph_objects as go 
from plotly.subplots import make_subplots

# Instead of setting the cell to Markdown, create Markdown from withnin a code cell!
# We can just use python variable replacement syntax to make the text dynamic
from IPython.display import Markdown as md


import myutils as ut

# set rows and columns to show all of it
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# setup for seaborn plots
sns.set_style('darkgrid')
sns.set_palette('Set2')

# the style of the map
MAP_STYLE = "open-street-map"
#MAP_STYLE = "stamen-terrain"


# center of the map
MAP_CENTER_LAT = 47.5 
MAP_CENTER_LNG = -122.249
In [ ]:
# the effort to change the font size in ipython.display.Markdown so 
# the fonts match in md() match the cell fonts.
# so far it does not work
from IPython.core.display import HTML
HTML("""
<style>

div.cell { /* Tunes the space between cells */
margin-top:1em;
margin-bottom:1em;
}

div.text_cell_render h1 { /* Main titles bigger, centered */
font-size: 2.2em;
line-height:1.4em;
text-align:center;
}

div.text_cell_render h2 { /*  Parts names nearer from text */
margin-bottom: -0.4em;
}


div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.5em;
line-height:1.4em;
padding-left:3em;
padding-right:3em;
}
</style>
""")

#%%html
#<style type='text/css'>
#.CodeMirror{
#font-size: 20px;
#}
#</style>
Out[ ]:

Main Dataframe¶

This is the dataframe after processing that is our base data for analysis.

In [ ]:
# there was a warning about mixed types in column=3 so I set the low_memory to False
# my processed data
df = pd.read_csv(ut.PATH_DATA_PROCESSED, low_memory = False)

# for keeping our max and min year of the dataframe
YEAR_MIN = min(df["WaterYear"])
YEAR_MAX = max(df["WaterYear"])

# summary of data frame
df.describe(include = "all")
Out[ ]:
Locator WaterYear WQI Month ParameterGroup lng lat MostRecentSample SiteName StreamName WQI_binned
count 319326 319326.000000 319326.000000 319326.000000 319326 319326.000000 319326.000000 319326 319326 319326 319326.000000
unique 78 NaN NaN NaN 13 NaN NaN 2 78 60 NaN
top 3106 NaN NaN NaN Temperature NaN NaN False Green River at Starfire Way Green NaN
freq 6981 NaN NaN NaN 28819 NaN NaN 318426 6981 25473 NaN
mean NaN 2003.907562 80.837086 6.983127 NaN -122.153147 47.566153 NaN NaN NaN 2.602231
std NaN 13.302435 25.510223 3.751370 NaN 0.143708 0.159583 NaN NaN NaN 0.655062
min NaN 1970.000000 0.000000 1.000000 NaN -122.508600 47.176100 NaN NaN NaN 1.000000
25% NaN 1994.000000 75.370000 4.000000 NaN -122.233900 47.465500 NaN NaN NaN 2.000000
50% NaN 2005.000000 90.820000 7.000000 NaN -122.166400 47.601100 NaN NaN NaN 3.000000
75% NaN 2016.000000 98.310000 10.000000 NaN -122.069600 47.705400 NaN NaN NaN 3.000000
max NaN 2023.000000 100.000000 13.000000 NaN -121.372200 47.811400 NaN NaN NaN 3.000000
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319326 entries, 0 to 319325
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Locator           319326 non-null  object 
 1   WaterYear         319326 non-null  int64  
 2   WQI               319326 non-null  float64
 3   Month             319326 non-null  int64  
 4   ParameterGroup    319326 non-null  object 
 5   lng               319326 non-null  float64
 6   lat               319326 non-null  float64
 7   MostRecentSample  319326 non-null  bool   
 8   SiteName          319326 non-null  object 
 9   StreamName        319326 non-null  object 
 10  WQI_binned        319326 non-null  int64  
dtypes: bool(1), float64(3), int64(3), object(4)
memory usage: 24.7+ MB

The dataframe looks like it:

In [ ]:
df.head()
Out[ ]:
Locator WaterYear WQI Month ParameterGroup lng lat MostRecentSample SiteName StreamName WQI_binned
0 311 1970 70.93 13 AnnualScore -122.2479 47.4655 False Green River at Interurban Green 2
1 311 1971 61.14 13 AnnualScore -122.2479 47.4655 False Green River at Interurban Green 2
2 311 1972 74.90 13 AnnualScore -122.2479 47.4655 False Green River at Interurban Green 2
3 311 1973 75.38 13 AnnualScore -122.2479 47.4655 False Green River at Interurban Green 2
4 311 1974 83.90 13 AnnualScore -122.2479 47.4655 False Green River at Interurban Green 3

Stations Locations¶

In [ ]:
# create a station location df
loc_df = df[['Locator', 'SiteName', 'lng', 'lat']]
loc_df = loc_df.drop_duplicates()
loc_df.head()
Out[ ]:
Locator SiteName lng lat
0 311 Green River at Interurban -122.2479 47.4655
49 317 Springbrook Creek mouth at SW 16th St -122.2326 47.4659
97 321 Crisp Creek mouth at SE Green Valley Rd -122.0671 47.2884
133 322 Newaukum Creek mouth at 212th Way SE -122.0562 47.2741
182 430 Lyon Creek mouth at Bothell Way NE -122.2778 47.7529
In [ ]:
loc_count = len(loc_df.index)
loc_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 78 entries, 0 to 2468
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Locator   78 non-null     object 
 1   SiteName  78 non-null     object 
 2   lng       78 non-null     float64
 3   lat       78 non-null     float64
dtypes: float64(2), object(2)
memory usage: 3.0+ KB
In [ ]:
md(f"Records show that there have been {loc_count} stations reporting WQI during the period {YEAR_MIN}-{YEAR_MAX}.")
Out[ ]:

Records show that there have been 78 stations reporting WQI during the period 1970-2023.

Stations locations are presented on the map below.

In [ ]:
fig = px.density_mapbox(loc_df, lat = "lat", lon = "lng", radius = 10, 
center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8, hover_name = "SiteName",
hover_data = 'Locator',
mapbox_style = MAP_STYLE,
range_color = [0, 1],
title = f"Locations of stations in King County, WA, {YEAR_MIN}-{YEAR_MAX}; {loc_count} stations.",
width = 800,
height = 900)
fig.show()
In [ ]:
# create a station location df wrt year
loc_year_df = df[['Locator', 'SiteName', 'lng', 'lat', 'WaterYear']]
loc_year_df = loc_year_df.drop_duplicates()
loc_year_df.head()
Out[ ]:
Locator SiteName lng lat WaterYear
0 311 Green River at Interurban -122.2479 47.4655 1970
1 311 Green River at Interurban -122.2479 47.4655 1971
2 311 Green River at Interurban -122.2479 47.4655 1972
3 311 Green River at Interurban -122.2479 47.4655 1973
4 311 Green River at Interurban -122.2479 47.4655 1974

A few questions that come to mind regards some descriptive statistics, how many stations were operating each year. Was it a steady growth? Were there any drops in the number of stations the region over the years?

In [ ]:
# loc_year_df.loc[loc_year_df["WaterYear"] == 2023]
In [ ]:
# if you do not sort it Dash might present the 'animation_frame' out of order
loc_year_df = loc_year_df.sort_values('WaterYear', ascending=True)
#loc_year_df.head()
In [ ]:
from dash import Dash, html, dash_table, dcc, callback, Input, Output

app = Dash(__name__)

def display_animated_graph(my_df):
    fig = px.density_mapbox(my_df, 
       lat = "lat", lon = "lng", radius = 10, 
       center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8, hover_name = "SiteName",
       hover_data = "Locator",
       mapbox_style = MAP_STYLE,
       title = f"Locations of stations for a selected year in King County, WA, over the years {YEAR_MIN}-{YEAR_MAX}.",
       animation_frame = "WaterYear")

    return fig

app.layout = html.Div([
    html.H4(children = ["Available stations in King County, WA, over years ", YEAR_MIN, "-", YEAR_MAX]),
    html.Div(children=["WQI in King County, WA, ", YEAR_MIN, "-", YEAR_MAX]),
    #dash_table.DataTable(data=loc_df.to_dict('records'), page_size=10),
    html.Hr(),
    
    # 'style'  will change the graph to be x% of the viewport height of the browser.
    # more can be found: https://css-tricks.com/fun-viewport-units/
    dcc.Graph(figure = display_animated_graph(loc_year_df), 
              id = "anim_graph",
              # style={'width': '90vw', 'height': '95vh'}
              style={'width': '900px', 'height': '800px'}
              )
    #dcc.Loading(dcc.Graph(id = "graph"), figure = display_animated_graph(loc_year_df), type = "cube"),
#    dcc.RadioItems(options=["2023", "2022"], value="2023", id="year_id"),
#    dcc.Graph(figure={}, id="controls_and_graph")
])

if __name__ == '__main__':
    app.run(debug=True)
In [ ]:
# get only AnnualScore and reshape the dataframe, this df will be used
# and showed later in this document.
df_an = df[df["ParameterGroup"] == "AnnualScore"]
#loc = df.Locator.unique().tolist()
# reshape the dataframe
#pd.melt(df_annual, id_vars="Locator", value_vars=loc, var_name = "Location", value_name='WQI')
df_an = df_an.pivot_table(index="WaterYear", columns=['Locator'], values=['WQI'])

# change the names of the columns because they are in the form of (WQI, '311')
locators = [x[1] for x in df_an.columns]
df_an.columns = locators
In [ ]:
# count the non-null values in the rows
no_stations_yearly = pd.DataFrame(df_an.count(axis=1), columns = ["StationsCount"])
#len(no_stations_yearly.columns)
# some basic descriptive stats
min_no_stations = no_stations_yearly["StationsCount"].min()
max_no_stations = no_stations_yearly["StationsCount"].max()

def get_no_stations(datafr, cond):
    return datafr[ (datafr == cond)["StationsCount"]]

years_for_min = get_no_stations(no_stations_yearly, min_no_stations)
years_for_max = get_no_stations(no_stations_yearly, max_no_stations)
In [ ]:
""" Plots the years_for_min and *_max
@in dat_fr : frame to be shown
"""
def plot_years(dat_fr):
    print(dat_fr)

    #fig = go.Figure(data=[go.Table(
    #    header=dict(values=['Year', 'No. of Stations']),
    #    cells=dict(values=[dat_fr.index, 
    #                       dat_fr["StationsCount"]]))
    #])
    #fig.update_layout(width=400, height = 300)
    #fig.show()
In [ ]:
md(f"<h5>First stations that started reporting WQI for the King County were:</h5>")
Out[ ]:
First stations that started reporting WQI for the King County were:
In [ ]:
oldest_stations = loc_year_df.loc[loc_year_df["WaterYear"] == 1970]
print(oldest_stations[['WaterYear', 'Locator', 'SiteName']].to_string(index = False))
 WaterYear Locator                  SiteName
      1970     311 Green River at Interurban
      1970  KTHA01             Piper's Creek
In [ ]:
curr_count = no_stations_yearly.loc[2023, "StationsCount"]
md(f"<h5>Currently we have {curr_count} stations reporting. "
   f"The service started in 1970 with two stations and it was the least number"
    f" of stations so far:"
   f"</h5>")
Out[ ]:
Currently we have 75 stations reporting. The service started in 1970 with two stations and it was the least number of stations so far:
In [ ]:
plot_years(years_for_min)
           StationsCount
WaterYear               
1970                   2

The maximum number of stations for the given period was:

In [ ]:
plot_years(years_for_max)
           StationsCount
WaterYear               
2021                  76

Let's see how the number of stations measuring WQI was changing over the years?

In [ ]:
def plot_no_stations_yearly(df_):
    f = px.line(df_, x=df_.index, y="StationsCount", markers=True,
              title=f"The number of stations measuring WQI in King County, WA, over years {YEAR_MIN}-{YEAR_MAX}.",
              labels = {"WaterYear" : "Year", "StationsCount" : "No. of Stations"})
    #f.update_traces(mode="markers+lines", hovertemplate=None)
    #f.update_layout(hovermode="x")
    f.update_xaxes(dtick=2)
    f.update_yaxes(dtick=10)
    return f

plot_no_stations_yearly(no_stations_yearly).show()

In general, the number of stations has been growing, although it is clear that certain years observed stations closures and/or resumptions of the service. For instance, the drops in the total number of stations servicing the region appeared in the following years:

  • 1973; 11 $\downarrow$: 14 stations compared to 25 stations in 1972,
  • 1978; 13 $\downarrow$: 18 stations compared to 31 stations in 1977, and
  • 2010; 28 $\downarrow$: 32 stations compared to 60 stations in 2009.

There were also sharp increases in the number of operating stations. The significant increases in the number of stations serving the region appeared in the following years:

  • 1972; 22 $\uparrow$: 25 stations compared to 3 in previous year
  • 1977; 18 $\uparrow$: 31 stations compared to 13 in 1975
  • 1979; 21 $\uparrow$: 39 stations compared to 18 in previous year
  • 2011; 13 $\uparrow$: 45 stations compared to 32 in prevous year
  • 2015; 30 $\uparrow$: 74 stations compared to 44 in 2012

If we study Figure Stations Location, we can see that, in general, the number of stations has been growing, although there are interruptions and resumptions in stations' service. The figure shows that in 1972, the stations were added to the core area. In 1973, stations from the middle of the area were closed, and in 1978, there was a closure of the north stations. However, in 1979, they restored stations that were closed in 1978. In 2007, six reporting entities were added to Vashon Island. Next, there were significant closures in the eastern frontier in 2010. However, in 2011, there was a substantial expansion including deep east, North Bend, Enumclaw, and far east at South Fork Skykomish at Hgwy 2.

Finally, in 2015, more stations were added to the eastern and northern parts of the area.

This short analysis leads to a conclusion that some stations have been serving for a long period of time, while some have a shorter tenure.

In order to analyze WQI over the years, I have prepared a more appropriate dataframe that focuses on the annual WQI. The following section provides some info about this slice.

Dataframe Used For WQI Analysis¶

Below, the basic info about the dataframe is presented.

In [ ]:
df_an.info()
<class 'pandas.core.frame.DataFrame'>
Index: 54 entries, 1970 to 2023
Data columns (total 78 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   0450CC         16 non-null     float64
 1   0484A          3 non-null      float64
 2   3106           53 non-null     float64
 3   311            49 non-null     float64
 4   317            48 non-null     float64
 5   321            36 non-null     float64
 6   322            49 non-null     float64
 7   430            51 non-null     float64
 8   434            50 non-null     float64
 9   438            30 non-null     float64
 10  440            48 non-null     float64
 11  442            45 non-null     float64
 12  444            44 non-null     float64
 13  446            48 non-null     float64
 14  470            51 non-null     float64
 15  474            51 non-null     float64
 16  478            50 non-null     float64
 17  484            47 non-null     float64
 18  486            48 non-null     float64
 19  631            48 non-null     float64
 20  632            46 non-null     float64
 21  A315           47 non-null     float64
 22  A319           45 non-null     float64
 23  A320           46 non-null     float64
 24  A432           44 non-null     float64
 25  A438           44 non-null     float64
 26  A456           42 non-null     float64
 27  A499           42 non-null     float64
 28  A617           26 non-null     float64
 29  A620           29 non-null     float64
 30  A631           45 non-null     float64
 31  A670           11 non-null     float64
 32  A680           31 non-null     float64
 33  A685           25 non-null     float64
 34  A687           5 non-null      float64
 35  A690           26 non-null     float64
 36  AMES_1         15 non-null     float64
 37  B319           50 non-null     float64
 38  B484           46 non-null     float64
 39  B499           7 non-null      float64
 40  BB470          20 non-null     float64
 41  BSE_1MUDMTNRD  13 non-null     float64
 42  C320           46 non-null     float64
 43  C370           36 non-null     float64
 44  C446           33 non-null     float64
 45  C484           45 non-null     float64
 46  CHERRY_1       16 non-null     float64
 47  D320           48 non-null     float64
 48  D474           34 non-null     float64
 49  F321           29 non-null     float64
 50  G320           46 non-null     float64
 51  GRIFFIN        13 non-null     float64
 52  HARRIS_1       15 non-null     float64
 53  J484           36 non-null     float64
 54  KSHZ06         37 non-null     float64
 55  KTHA01         54 non-null     float64
 56  KTHA02         32 non-null     float64
 57  KTHA03         34 non-null     float64
 58  LSIN1          23 non-null     float64
 59  LSIN9          22 non-null     float64
 60  MFK_SNQ        13 non-null     float64
 61  N484           45 non-null     float64
 62  NFK_SNQ        13 non-null     float64
 63  PATTER_3       15 non-null     float64
 64  RAGING_MTH     13 non-null     float64
 65  S478           18 non-null     float64
 66  S484           38 non-null     float64
 67  SFK_SNQ        13 non-null     float64
 68  SKYKOMISH      13 non-null     float64
 69  SNQDUVALL      13 non-null     float64
 70  TOLT_MTH       13 non-null     float64
 71  VA12A          16 non-null     float64
 72  VA37A          12 non-null     float64
 73  VA41A          16 non-null     float64
 74  VA42A          17 non-null     float64
 75  VA45A          16 non-null     float64
 76  VA65A          15 non-null     float64
 77  X630           24 non-null     float64
dtypes: float64(78)
memory usage: 35.4 KB

This is how a few rows of the sliced dataframe look like:

In [ ]:
df_an.head()
Out[ ]:
0450CC 0484A 3106 311 317 321 322 430 434 438 440 442 444 446 470 474 478 484 486 631 632 A315 A319 A320 A432 A438 A456 A499 A617 A620 A631 A670 A680 A685 A687 A690 AMES_1 B319 B484 B499 BB470 BSE_1MUDMTNRD C320 C370 C446 C484 CHERRY_1 D320 D474 F321 G320 GRIFFIN HARRIS_1 J484 KSHZ06 KTHA01 KTHA02 KTHA03 LSIN1 LSIN9 MFK_SNQ N484 NFK_SNQ PATTER_3 RAGING_MTH S478 S484 SFK_SNQ SKYKOMISH SNQDUVALL TOLT_MTH VA12A VA37A VA41A VA42A VA45A VA65A X630
WaterYear
1970 NaN NaN NaN 70.93 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37.89 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1971 NaN NaN 48.31 61.14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23.36 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1972 NaN NaN 67.97 74.90 NaN 83.43 77.38 80.32 84.46 88.33 89.09 95.55 NaN NaN 86.95 87.80 91.22 90.80 NaN 89.5 88.82 NaN 79.75 89.12 NaN 93.45 NaN NaN NaN NaN 83.5 NaN NaN NaN NaN NaN NaN 71.03 87.38 NaN NaN NaN 92.2 NaN NaN NaN NaN 91.42 NaN NaN 84.40 NaN NaN NaN NaN 47.24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1973 NaN NaN 67.19 75.38 92.03 84.14 60.90 61.22 NaN NaN NaN NaN NaN NaN 66.29 55.99 62.88 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 85.98 54.10 NaN NaN NaN NaN NaN NaN NaN NaN NaN 49.08 NaN 67.46 NaN NaN NaN NaN 51.35 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1974 NaN NaN 87.43 83.90 NaN NaN NaN 37.59 41.80 NaN NaN NaN NaN NaN 39.42 30.67 38.24 36.28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 51.15 NaN NaN 32.02 NaN NaN NaN NaN 63.2 NaN 63.21 NaN NaN NaN NaN NaN 64.63 NaN NaN NaN 54.61 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

The descriptive stats are presented as follows:

In [ ]:
#%%capture
df_an.describe()
Out[ ]:
0450CC 0484A 3106 311 317 321 322 430 434 438 440 442 444 446 470 474 478 484 486 631 632 A315 A319 A320 A432 A438 A456 A499 A617 A620 A631 A670 A680 A685 A687 A690 AMES_1 B319 B484 B499 BB470 BSE_1MUDMTNRD C320 C370 C446 C484 CHERRY_1 D320 D474 F321 G320 GRIFFIN HARRIS_1 J484 KSHZ06 KTHA01 KTHA02 KTHA03 LSIN1 LSIN9 MFK_SNQ N484 NFK_SNQ PATTER_3 RAGING_MTH S478 S484 SFK_SNQ SKYKOMISH SNQDUVALL TOLT_MTH VA12A VA37A VA41A VA42A VA45A VA65A X630
count 16.000000 3.000000 53.000000 49.000000 48.000000 36.000000 49.000000 51.000000 50.000000 30.000000 48.000000 45.000000 44.000000 48.000000 51.000000 51.000000 50.000000 47.000000 48.000000 48.000000 46.000000 47.000000 45.000000 46.000000 44.000000 44.000000 42.000000 42.000000 26.000000 29.000000 45.000000 11.000000 31.000000 25.000000 5.000000 26.000000 15.000000 50.000000 46.000000 7.000000 20.000000 13.000000 46.000000 36.000000 33.000000 45.000000 16.000000 48.000000 34.000000 29.000000 46.000000 13.000000 15.000000 36.000000 37.000000 54.000000 32.000000 34.000000 23.000000 22.000000 13.00000 45.000000 13.000000 15.000000 13.000000 18.000000 38.000000 13.000000 13.000000 13.000000 13.000000 16.000000 12.000000 16.000000 17.000000 16.000000 15.000000 24.000000
mean 50.641250 76.490000 55.008679 59.684490 22.909583 69.857222 40.253265 49.911765 39.699600 81.270000 66.259792 60.903111 45.731818 47.083542 55.338235 44.491373 56.934600 50.723404 60.416667 63.294792 58.173261 37.267660 75.853778 75.323478 51.456591 83.093182 44.704048 62.548333 62.356154 62.341034 74.443556 72.128182 68.190000 79.306000 77.428000 78.972308 57.828667 82.786800 56.087826 68.325714 75.032000 68.303077 82.536739 50.669167 47.975758 63.139333 87.335000 85.342292 62.207941 87.160690 64.294130 89.433846 88.794000 70.484444 44.191892 54.216111 49.646875 63.874118 63.097826 94.545909 80.21000 68.792222 85.116923 68.924667 76.843077 75.176111 54.968684 73.287692 91.270769 78.295385 87.427692 78.340000 68.805833 58.378125 64.232353 49.475000 68.108000 45.736250
std 13.520557 9.666628 13.229444 11.770567 20.363709 12.253840 16.985961 17.469803 15.777013 9.045613 14.695290 19.475125 12.606108 15.870263 13.683539 13.752416 15.392678 17.705161 9.471905 13.445372 14.165452 24.838234 10.250498 13.623310 16.530379 9.051694 16.759309 14.498191 14.110325 10.932884 9.707563 10.976939 10.587843 10.677582 7.327446 12.278895 12.122151 9.150209 15.077185 12.241487 7.825058 11.498756 6.153182 19.603376 16.543382 11.398525 4.870128 4.846396 15.130157 12.287005 11.933127 3.674873 4.149175 10.840733 14.589828 15.451880 17.142440 14.007483 33.614609 5.820144 7.87998 9.137460 4.876263 9.090548 11.399926 9.759039 14.072949 11.829651 3.719611 10.628369 6.268454 11.019119 18.001418 17.678572 15.580920 20.147753 13.913445 14.073361
min 32.790000 69.780000 26.740000 33.840000 1.000000 46.850000 4.490000 13.000000 12.180000 57.280000 27.990000 19.290000 21.330000 8.150000 28.030000 13.420000 20.690000 7.820000 40.150000 33.770000 32.250000 1.000000 51.850000 27.190000 15.380000 57.690000 13.420000 38.070000 27.880000 36.750000 46.800000 47.630000 39.320000 43.190000 67.480000 40.590000 43.030000 58.920000 22.780000 49.720000 62.310000 51.060000 68.450000 11.070000 14.470000 41.330000 79.360000 73.350000 27.270000 57.220000 43.090000 84.170000 80.940000 33.930000 16.520000 23.360000 6.830000 18.870000 7.130000 83.470000 67.49000 49.330000 74.850000 53.480000 58.930000 54.610000 28.450000 57.460000 84.260000 54.180000 75.650000 50.540000 29.740000 29.380000 29.710000 19.780000 38.810000 19.470000
25% 41.137500 70.950000 43.740000 50.600000 6.310000 60.095000 30.350000 36.700000 27.492500 76.672500 58.500000 46.010000 37.280000 34.177500 46.595000 35.745000 45.690000 36.820000 55.705000 53.997500 49.152500 14.525000 70.480000 68.132500 40.157500 77.087500 32.510000 49.950000 53.625000 57.910000 69.080000 66.225000 61.760000 74.390000 72.670000 76.347500 50.810000 78.992500 48.602500 64.445000 72.032500 59.330000 78.202500 38.867500 35.330000 55.700000 83.872500 81.990000 56.322500 76.170000 55.662500 86.000000 85.920000 66.787500 32.370000 42.482500 38.767500 58.557500 25.410000 91.042500 77.62000 64.060000 82.790000 63.720000 67.260000 67.172500 43.875000 62.200000 88.690000 75.770000 84.670000 73.042500 58.220000 41.060000 54.240000 34.512500 56.985000 38.172500
50% 48.705000 72.120000 55.100000 61.140000 20.260000 72.010000 39.330000 47.200000 38.820000 83.415000 68.040000 61.210000 43.490000 47.735000 57.450000 45.720000 58.675000 53.660000 61.290000 64.695000 58.050000 37.420000 79.320000 78.475000 50.935000 84.990000 43.955000 62.845000 64.165000 63.570000 77.020000 75.000000 68.420000 77.440000 79.960000 81.490000 52.880000 85.250000 57.315000 67.060000 73.875000 70.070000 83.760000 48.840000 49.130000 63.840000 87.075000 84.845000 63.045000 93.280000 65.945000 89.930000 88.320000 71.970000 44.910000 57.835000 53.775000 63.745000 80.830000 97.840000 80.06000 70.240000 85.720000 66.570000 79.960000 77.810000 52.860000 73.480000 92.050000 78.810000 87.730000 81.165000 69.685000 62.675000 68.790000 48.225000 72.290000 44.610000
75% 57.322500 79.845000 63.990000 65.760000 32.542500 79.810000 50.030000 63.740000 48.965000 87.557500 76.645000 75.580000 54.110000 59.687500 65.230000 53.180000 69.287500 61.020000 64.887500 72.535000 66.837500 51.930000 82.930000 85.367500 64.272500 90.550000 55.417500 74.222500 70.435000 69.290000 80.930000 79.550000 75.305000 87.090000 81.000000 85.945000 63.305000 89.480000 66.380000 71.305000 78.317500 74.760000 86.367500 62.347500 59.230000 68.780000 90.330000 88.922500 67.850000 95.440000 71.985000 92.460000 92.040000 76.367500 52.880000 65.787500 61.037500 72.295000 90.955000 99.692500 84.68000 74.370000 86.650000 75.875000 82.870000 83.325000 67.060000 83.360000 93.010000 87.250000 91.500000 86.235000 77.590000 72.522500 75.730000 66.945000 78.220000 53.650000
max 85.460000 87.570000 87.430000 86.580000 92.030000 94.050000 80.710000 88.720000 84.460000 96.160000 91.890000 95.550000 75.480000 81.370000 89.830000 87.800000 91.220000 90.800000 86.740000 89.500000 88.820000 96.140000 93.330000 92.960000 87.630000 97.480000 86.690000 88.500000 93.200000 84.490000 92.930000 86.260000 86.220000 94.240000 86.030000 94.160000 82.390000 94.740000 87.380000 90.000000 93.230000 94.120000 94.930000 90.540000 72.990000 86.160000 97.660000 95.110000 92.840000 98.690000 95.150000 96.150000 94.850000 89.570000 79.610000 82.780000 82.180000 85.210000 99.710000 99.930000 93.25000 90.430000 94.710000 86.320000 96.480000 87.280000 80.000000 92.030000 98.040000 90.310000 97.490000 92.040000 95.150000 79.200000 86.820000 86.120000 89.520000 77.700000

Now, I will continue to the analysis of WQI.

WQI Over The Years 1970-2023¶

According to [1] WQI (Water Quality Index) stratifies water quality with respect to a degree of "concern" into three groups where the lower WQI values indicate higher concerns to improve water quality in a given area:

  • high concern for WQI $<$ 40 (not clean water)
  • modern concern 40 $\le$ WQI $<$ 80
  • low concern WQI $\ge$ 80 (clean water)
In [ ]:
md(f"We can start off this analysis by plotting the value for WQI for all "
   f"available years, i.e., {YEAR_MIN}-{YEAR_MAX} from all available stations."
   f" I added the lines that stratify WQI values.")
Out[ ]:

We can start off this analysis by plotting the value for WQI for all available years, i.e., 1970-2023 from all available stations. I added the lines that stratify WQI values.

In [ ]:
"""
    Plot WQI lines in plotly
    @param fig_ : (in/out) figure to be modified by adding two horizontal lines
"""
def add_WQI_lines(fig_):
    # high concern line
    fig_.add_hline(y=40, line_dash="dash", line_color="red")
                  #annotation_position="top", annotation_text="High concern" )
    fig_.add_hline(y=80, line_dash="dash", line_color="green")
                  #annotation_position = "bottom", annotation_text="Low concern")
    fig_.update_xaxes(dtick=2)
    fig_.update_yaxes(dtick=10)
    return fig_
    
# -----------------------
# plot with lines
# -----------------------
def plot_hlines():
    plt.axhline(y=40, color='r', linestyle='dashed')
    plt.axhline(y=80, color='b', linestyle='dashed')

"""Create dataframe describing stats of the columns: min, max, and median
    @param dat_fr: in 
"""
def crt_stat_df(dat_fr):
    return pd.DataFrame(
        {'min' : dat_fr.min(axis=0),
          'max' : dat_fr.max(axis=0),
          'median' : dat_fr.median(axis=0)}
    ).sort_values('median', ascending=False)
In [ ]:
"""
In case you have more lines than the colors present in the colormap, this is
how you can construct a custom colorscale so that you get one complete sequence
instead of a cycling sequence:
    @param base_color_seq (in) The base color sequence we are interested in
    @param colors_count The final number of colors that you want to have in
                        this color scale
    @return The custom color scale, e.g., px.colors.sequential.RdBu
"""
def get_custom_colorscale(base_color_seq, colors_count):
    
    return px.colors.sample_colorscale(
        colorscale = base_color_seq,
        samplepoints=colors_count,
        low=0.0,
        high=1.0,
        colortype="rgb")
""" 
    Prepare the figure for all WQI in a dataframe and all stations
    @param df_ The dataframe that contains all that WQI information
    @return f A plotly figure ready to be shown()
"""
def yearly_WQI_all_fig(df_):
    
    #df_an.plot(title = "AnnualScore WQI for all locators in Kings County.\n"
    #       "According to [1] stations with WQI < 40 are of 'high concern',\n"
    #       "[40; 80) are of 'modern concern', >= 80 'low concern.'",
    #       legend=False, 
    #       ylabel= "WQI")
    #plot_hlines()
    #plt.show()

    # sort the columns according to median in a descending order
    # get the median of the columns and sort the values so
    # and then start from the highest median to the lowest
    sorted_cols = (df_.median(axis = 0)).sort_values(ascending=False).index

    f = px.line(df_, x = df_.index, y = sorted_cols, 
                labels = {"value" : "WQI",
                          "variable" : "Station ID"},
                title = "AnnualScore WQI for all locators in Kings County, WA,"
                f" for years {YEAR_MIN}-{YEAR_MAX}. The legend is sorted in a "
                "descending order according to stations' WQI median. "
                f"{len(df_.columns)} stations.",
                height = 900,
                # get the reverse colors, i.e., do for your plotly color sequence [::-1]
                color_discrete_sequence=get_custom_colorscale(px.colors.sequential.RdBu[::-1], len(sorted_cols)),
                markers = True
                
                )
    f.add_annotation(x='1970', y=40, text="High concern")
    f.add_annotation(x='1970', y=80, text="Low concern")
    # prevent from changing the scale when turning off/on locators
    f.update_yaxes(range = [-5, 100])
    
    return add_WQI_lines(f)

yearly_WQI_all_fig(df_an).show()

As you can see, the plot is difficult to read and introduces many concerns. We can see that LSIN1 has been recently having a down spike in WQI, although it in the past it had high WQI indicators. The figure also shows interruptions and resumptions of the service at some location.

Section Stations Locations gave us some overview regarding the aerial coverage and timespan of recorded WQIs.

The data we have shows interruptions, additions, as well as resumptions of stations servicing the area. This analysis is focused more on a general status of the region rather than on individual cases and as such it is important to answer the question how many records for a given station we can have to perform further analysis

The following section deals with this issue.

Stations' Selection¶

Let's take a look at how many reports we should expect if locators, which is another word for stations measuring WQI, lasted for the entire available period.

In [ ]:
md(f"If we consider all available years, i.e, {YEAR_MIN} - {YEAR_MAX}, we "
   f"should have {YEAR_MAX-YEAR_MIN+1} records per station.")
Out[ ]:

If we consider all available years, i.e, 1970 - 2023, we should have 54 records per station.

The histogram below shows frequency of available records with respect to locators.

In [ ]:
"""Get the count of non-null values in dataframe 
    @param df_ (in) : data frame to be examined
    @return a series with frequencies sorted in a descending manner
"""
def get_and_sort_freqs(df_):
    # get the number of counts of non-null values in, sorting
    # to get extra info later
    return df_.count().sort_values(ascending = False)

rec_freq = get_and_sort_freqs(df_an)

""" Shows the frequency plot for series_
    @param series_ What we want to present
    @param title_ The title to be shown

    @return the plot to be shown
"""
def plot_freqs(series_, title_):
    return px.histogram(series_, 
        title=title_,
        labels = { 'value': '# of Available Records'},
        text_auto = True,
        marginal='box'
    ).update_layout(showlegend = False)

plot_freqs(rec_freq, f"Frequency of available records for timeframe {YEAR_MIN}-{YEAR_MAX} with respect to locators.").show()
In [ ]:
#print(rec_freq.index[0])
#print(my_locator)
my_locator = loc_df.loc[loc_df['Locator'] == rec_freq.index[0], 'SiteName']
#print(my_locator.to_string(index = False))
md(f"There is only one station with all records available, i.e., {rec_freq[0]}, "
   f"available and it is *{my_locator.to_string(index = False)}*, {rec_freq.index[0]}."
   "The median is 34 records.")
Out[ ]:

There is only one station with all records available, i.e., 54, available and it is Piper's Creek, KTHA01.The median is 34 records.

In [ ]:
rec_freq
Out[ ]:
KTHA01           54
3106             53
430              51
474              51
470              51
B319             50
434              50
478              50
311              49
322              49
631              48
446              48
486              48
440              48
D320             48
317              48
A315             47
484              47
G320             46
C320             46
A320             46
B484             46
632              46
C484             45
442              45
N484             45
A631             45
A319             45
444              44
A438             44
A432             44
A499             42
A456             42
S484             38
KSHZ06           37
J484             36
C370             36
321              36
D474             34
KTHA03           34
C446             33
KTHA02           32
A680             31
438              30
F321             29
A620             29
A690             26
A617             26
A685             25
X630             24
LSIN1            23
LSIN9            22
BB470            20
S478             18
VA42A            17
VA12A            16
VA41A            16
VA45A            16
0450CC           16
CHERRY_1         16
HARRIS_1         15
VA65A            15
AMES_1           15
PATTER_3         15
MFK_SNQ          13
TOLT_MTH         13
SNQDUVALL        13
SKYKOMISH        13
SFK_SNQ          13
BSE_1MUDMTNRD    13
RAGING_MTH       13
NFK_SNQ          13
GRIFFIN          13
VA37A            12
A670             11
B499              7
A687              5
0484A             3
dtype: int64

Now, it is time to decide which stations take into account for our analysis. It will determine how many records will be available at least per station, and which years in the recorded history will be missing. In the below table is a summary of my findings based on which I decided to go with Group B, which ensures that at most 15% of records can be missing, i.e., 46 records available at least per station, the missing records can include years $\le 1978$ and/or [2010; 2014] inclusive. Please skip to Section if you do want to skip straight to the WQI analysis.

Group No. of Avail Recs ($\ge$) No. of Missing Recs ($\le$) % Missing ($\le$) No. of Stations Area Covered
A 49 5 9% 10 Northern, Tukwila, south of Maple Valley
B 46 8 15% 23 Eastern, central, southern
C 43 11 20% 31 Eastern, central

Note that $A \supseteq B \supseteq C$, as the group $A$ has the strictest constraints. Group B covers the area that is covered by group A; it also covers more eastern, central and southern parts of the region (similarly with group C).

If we take a look at Group A that offers at least 49 records, i.e., 9% records missing at most and we get 10 stations. They cover three regions: northern parts of the region, two stations at Tukwila, and south of Maple Valley at the Green River. If they miss records it might be records from 1978 and/or prior to 1978. There is one station that, in addition, misses records [2010; 2014] (inclusive).

If we consider stations with at least 46 records available, i.e., 15% records missing at most, the number of stations increases to 23, and the area coverage includes additional stations in the eastern, central and southern parts of the region. Apart from records missing reports $\le 1978$, now there are 13 stations missing at least one record within [2010; 2014] (inclusive) and there is one station missing reports also for years [2022; 2023].

If we relax a constraint to at least 43 records available per station, i.e., $\le$ 20% missing, we have 31 stations that cover even more of the central part and a little bit more of the eastern part.

In [ ]:
# get the stations with records the largest amount of records

"""
    Get the locators with the at least rec_count records available
    @param rec_count how many records do we want to have
    @param freqs the frequencies series, I assume the 
                   freqs is a series with a decreasing order
    @return list of locators with at least rec_count
"""
def get_locators_with_at_least(rec_count, freqs):    
    return [x for x in freqs.index if freqs[x] >= rec_count]

"""
    Get the years that are missing
 @param loc_list The list of the locators to be checked.
 @param dat_fr  The data frame to be checked
 @return the list of missing years against dat_fr
"""
def get_missing_years(loc_list, dat_fr):
    res = []
    for ll in loc_list:
        ss = pd.isnull(dat_fr[ll])
        years = ss[ss == True].index.to_list()
        if (len(years) != 0):
            res.append([ll, years])
            #print(f"{ll}: {years}")
    return res

""" 
    print locators and missing years and returns the frame of locators

    @param rec_count in: the least number of records
    @param freqs in: the frequencies series, I assume the freqs is a series
                 in a decreasing order
    @param dat_fr in: the data frame to be used
    @ret   locs - a frame of stations that has at least rec_count records available
"""                  
def print_locs_and_missing_years(rec_count, freqs, dat_fr):
    locs = get_locators_with_at_least(rec_count, freqs)
    total_rec = YEAR_MAX - YEAR_MIN + 1
    perc_missing_no_of_rec = 100 - round((rec_count / total_rec) * 100)
    missing_no = total_rec - rec_count
    print(f"---------\nRecords avail. >= rec_count={rec_count}, "
          f"<= {perc_missing_no_of_rec}% missing, "
          f"missing records (at most) {missing_no}\n"
          f"locators={locs}\nlen(locs) = {len(locs)}")
    print(f"\nMissing years:")
    missing_years = get_missing_years(locs, dat_fr)
    # print a list of lists with linebreaks
    print('\n'.join(map(str, missing_years)))

    return locs   


"""
    Prepare the density mapbox figure for a location
    @param dat_fr (in) a dataframe showing the locators ids satifying
                       a frequency condition
    @param location_years_df (in) dataframe showing the years for a given locator
    @ret density_mapbox figure
"""
def get_fig_density(dat_fr, location_years_df, descr=""):
    # get the location of locators
    l_df = location_years_df.loc[location_years_df['Locator'].isin(dat_fr)]
    
    f = px.density_mapbox(l_df, lat = "lat", lon = "lng", radius = 10, 
                          center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8, 
                          hover_name = "SiteName", hover_data = "Locator",
                          mapbox_style = MAP_STYLE, title = descr,
                          width = 800, height = 900)
    return f


"""Show the density mapbox for a location
    @param dat_fr (in) a dataframe showing the locators ids satifying
                       a frequency condition
    @param location_years_df (in) dataframe showing the years for a given locator
"""
def show_density(dat_fr, location_years_df, descr=""):    
    f = get_fig_density(dat_fr, location_years_df, descr)
    f.show()


""" Show data that will help to determine which stations select 
    for further analysis: print some descriptive stats and 
    plot figures.

    @param rec_count in: the least number of records
    @param freqs in: the frequencies series, I assume the freqs is a series
                 in a decreasing order
    @param dat_fr in: the data frame to be used
    @ret   locs - a frame of stations that has at least rec_count records available

"""
def do_loc_analysis(freqs, location_years_df, dat_fr):
    # available years
    avail_years = [49, 46, 43]

    figs = []
    for y in avail_years:
        fr = print_locs_and_missing_years(y, freqs, dat_fr)
        print()
        figs.append(get_fig_density(fr, location_years_df, 
                     f"Stations with at least {y} records available for "
                     f"{YEAR_MIN}-{YEAR_MAX}. Stations count = {len(fr)}"))
        # show_density(fr, location_years_df, 
        #             f"Stations with at least {y} records available for {YEAR_MIN}-{YEAR_MAX}.")
        
    [f.show() for f in figs]

do_loc_analysis(rec_freq, loc_year_df, df_an)
---------
Records avail. >= rec_count=49, <= 9% missing, missing records (at most) 5
locators=['KTHA01', '3106', '430', '474', '470', 'B319', '434', '478', '311', '322']
len(locs) = 10

Missing years:
['3106', [1970]]
['430', [1970, 1971, 1978]]
['474', [1970, 1971, 1978]]
['470', [1970, 1971, 1978]]
['B319', [1970, 1971, 1974, 1975]]
['434', [1970, 1971, 1973, 1978]]
['478', [1970, 1971, 1975, 1978]]
['311', [2010, 2011, 2012, 2013, 2014]]
['322', [1970, 1971, 1974, 1975, 1976]]

---------
Records avail. >= rec_count=46, <= 15% missing, missing records (at most) 8
locators=['KTHA01', '3106', '430', '474', '470', 'B319', '434', '478', '311', '322', '631', '446', '486', '440', 'D320', '317', 'A315', '484', 'G320', 'C320', 'A320', 'B484', '632']
len(locs) = 23

Missing years:
['3106', [1970]]
['430', [1970, 1971, 1978]]
['474', [1970, 1971, 1978]]
['470', [1970, 1971, 1978]]
['B319', [1970, 1971, 1974, 1975]]
['434', [1970, 1971, 1973, 1978]]
['478', [1970, 1971, 1975, 1978]]
['311', [2010, 2011, 2012, 2013, 2014]]
['322', [1970, 1971, 1974, 1975, 1976]]
['631', [1970, 1971, 1973, 1974, 1977, 1978]]
['446', [1970, 1971, 1972, 1973, 1974, 1978]]
['486', [1970, 1971, 1972, 1973, 1974, 1975]]
['440', [1970, 1971, 1973, 1974, 1975, 1976]]
['D320', [1970, 1971, 1973, 1974, 1975, 1976]]
['317', [1970, 1971, 1972, 1974, 1975, 1976]]
['A315', [1970, 1971, 1972, 1973, 1974, 1975, 1976]]
['484', [1970, 1971, 1973, 1975, 1978, 2022, 2023]]
['G320', [1970, 1971, 1974, 1975, 1976, 2010, 2011, 2012]]
['C320', [1970, 1971, 1973, 1974, 1975, 1976, 2010, 2011]]
['A320', [1970, 1971, 1973, 1974, 1975, 1976, 1977, 1978]]
['B484', [1970, 1971, 1974, 1975, 1978, 2010, 2011, 2012]]
['632', [1970, 1971, 1973, 1974, 1977, 1978, 2012, 2014]]

---------
Records avail. >= rec_count=43, <= 20% missing, missing records (at most) 11
locators=['KTHA01', '3106', '430', '474', '470', 'B319', '434', '478', '311', '322', '631', '446', '486', '440', 'D320', '317', 'A315', '484', 'G320', 'C320', 'A320', 'B484', '632', 'C484', '442', 'N484', 'A631', 'A319', '444', 'A438', 'A432']
len(locs) = 31

Missing years:
['3106', [1970]]
['430', [1970, 1971, 1978]]
['474', [1970, 1971, 1978]]
['470', [1970, 1971, 1978]]
['B319', [1970, 1971, 1974, 1975]]
['434', [1970, 1971, 1973, 1978]]
['478', [1970, 1971, 1975, 1978]]
['311', [2010, 2011, 2012, 2013, 2014]]
['322', [1970, 1971, 1974, 1975, 1976]]
['631', [1970, 1971, 1973, 1974, 1977, 1978]]
['446', [1970, 1971, 1972, 1973, 1974, 1978]]
['486', [1970, 1971, 1972, 1973, 1974, 1975]]
['440', [1970, 1971, 1973, 1974, 1975, 1976]]
['D320', [1970, 1971, 1973, 1974, 1975, 1976]]
['317', [1970, 1971, 1972, 1974, 1975, 1976]]
['A315', [1970, 1971, 1972, 1973, 1974, 1975, 1976]]
['484', [1970, 1971, 1973, 1975, 1978, 2022, 2023]]
['G320', [1970, 1971, 1974, 1975, 1976, 2010, 2011, 2012]]
['C320', [1970, 1971, 1973, 1974, 1975, 1976, 2010, 2011]]
['A320', [1970, 1971, 1973, 1974, 1975, 1976, 1977, 1978]]
['B484', [1970, 1971, 1974, 1975, 1978, 2010, 2011, 2012]]
['632', [1970, 1971, 1973, 1974, 1977, 1978, 2012, 2014]]
['C484', [1970, 1971, 1972, 1973, 1975, 1978, 2010, 2011, 2012]]
['442', [1970, 1971, 1973, 1974, 1975, 1976, 2010, 2011, 2012]]
['N484', [1970, 1971, 1972, 1973, 1975, 1978, 2010, 2011, 2012]]
['A631', [1970, 1971, 1973, 1974, 1977, 1978, 2010, 2011, 2012]]
['A319', [1970, 1971, 1973, 1974, 1975, 2010, 2011, 2013, 2014]]
['444', [1970, 1971, 1972, 1973, 1974, 1975, 1976, 2010, 2011, 2012]]
['A438', [1970, 1971, 1973, 1974, 1975, 2010, 2011, 2012, 2013, 2014]]
['A432', [1970, 1971, 1972, 1973, 1974, 1975, 1978, 2010, 2011, 2012]]

WQI Analysis¶

In order to analyze how WQI has been changing over the years, I have chosen Group B explained in Section. In summary: there are 23 stations and each of the stations will miss at most 15% of possible records, i.e., 9, for the period 1970-2023 (for various reasons, one of them might be that a station might not exist at a given time or it had interruptions in service.) The missing records can include years $\le 1978$ and/or [2010; 2014] inclusive. The area covers stations located in eastern, central, northern, and southern parts of the region.

In [ ]:
# df_an holds the column as locators, index as years and values as WQI
df_an.head()
Out[ ]:
0450CC 0484A 3106 311 317 321 322 430 434 438 440 442 444 446 470 474 478 484 486 631 632 A315 A319 A320 A432 A438 A456 A499 A617 A620 A631 A670 A680 A685 A687 A690 AMES_1 B319 B484 B499 BB470 BSE_1MUDMTNRD C320 C370 C446 C484 CHERRY_1 D320 D474 F321 G320 GRIFFIN HARRIS_1 J484 KSHZ06 KTHA01 KTHA02 KTHA03 LSIN1 LSIN9 MFK_SNQ N484 NFK_SNQ PATTER_3 RAGING_MTH S478 S484 SFK_SNQ SKYKOMISH SNQDUVALL TOLT_MTH VA12A VA37A VA41A VA42A VA45A VA65A X630
WaterYear
1970 NaN NaN NaN 70.93 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37.89 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1971 NaN NaN 48.31 61.14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23.36 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1972 NaN NaN 67.97 74.90 NaN 83.43 77.38 80.32 84.46 88.33 89.09 95.55 NaN NaN 86.95 87.80 91.22 90.80 NaN 89.5 88.82 NaN 79.75 89.12 NaN 93.45 NaN NaN NaN NaN 83.5 NaN NaN NaN NaN NaN NaN 71.03 87.38 NaN NaN NaN 92.2 NaN NaN NaN NaN 91.42 NaN NaN 84.40 NaN NaN NaN NaN 47.24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1973 NaN NaN 67.19 75.38 92.03 84.14 60.90 61.22 NaN NaN NaN NaN NaN NaN 66.29 55.99 62.88 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 85.98 54.10 NaN NaN NaN NaN NaN NaN NaN NaN NaN 49.08 NaN 67.46 NaN NaN NaN NaN 51.35 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1974 NaN NaN 87.43 83.90 NaN NaN NaN 37.59 41.80 NaN NaN NaN NaN NaN 39.42 30.67 38.24 36.28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 51.15 NaN NaN 32.02 NaN NaN NaN NaN 63.2 NaN 63.21 NaN NaN NaN NaN NaN 64.63 NaN NaN NaN 54.61 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [ ]:
""" get the group as a dataframe with a at least_records available
    at_least_rec in: (int) The number of records at least available at the stations
    freqs in: data frame with frequency of stations over the years
    dat_fr in: a main data frame with all records

"""
def get_gr_df(at_least_rec, freqs, dat_fr):    
    # get the list list of locators/stations with at least
    gr_list = get_locators_with_at_least(at_least_rec, freqs)
    
    # this is the list of group C WQI values
    wqi_gr_df = dat_fr[gr_list]

    #print(gr_list)
    #print(dat_fr[gr_list])
    #len(df_an[gr_list].columns)

    return wqi_gr_df

# our group C dataframe
# how many records at least in each station
C_GRP_COUNT=46
grp_c_df = get_gr_df(C_GRP_COUNT, rec_freq, df_an)

#type(grp_c_df)
#print(f"grp_c_df.index={grp_c_df.index}")

#l_df = loc_year_df.loc[loc_year_df['Locator'].isin(gr_c_list)]
#print(l_df)
#print(l_df[l_df["Locator"] == 'KTHA01'])
#df_an['KTHA01']

Let's see how in general WQI was performing over the years, what min/max and median were.

In [ ]:
# get grp C stats
grp_c_stats_df = crt_stat_df(grp_c_df).reset_index()
grp_c_stats_df = grp_c_stats_df.rename(columns= {'index' : 'Locator'})
grp_c_stats_df.index.name = 'index'
grp_c_stats_df
Out[ ]:
Locator min max median
index
0 B319 58.92 94.74 85.250
1 D320 73.35 95.11 84.845
2 C320 68.45 94.93 83.760
3 A320 27.19 92.96 78.475
4 440 27.99 91.89 68.040
5 G320 43.09 95.15 65.945
6 631 33.77 89.50 64.695
7 486 40.15 86.74 61.290
8 311 33.84 86.58 61.140
9 478 20.69 91.22 58.675
10 632 32.25 88.82 58.050
11 KTHA01 23.36 82.78 57.835
12 470 28.03 89.83 57.450
13 B484 22.78 87.38 57.315
14 3106 26.74 87.43 55.100
15 484 7.82 90.80 53.660
16 446 8.15 81.37 47.735
17 430 13.00 88.72 47.200
18 474 13.42 87.80 45.720
19 322 4.49 80.71 39.330
20 434 12.18 84.46 38.820
21 A315 1.00 96.14 37.420
22 317 1.00 92.03 20.260
In [ ]:
"""Show the Group C statistics with a plot
    @param grp_df_ : in : the dataframe that contains statistics
    @param rec_count : in: how many records in the group
    @param period : in : 

"""
def plot_grp_stats(grp_df_, rec_count, period, tickangle_ = 0):
    #sorted_df.grp_df_.median(axis = 0).sort_values()
    #print(sorted_df)

    sorted_df = grp_df_.median(axis = 0).sort_values(ascending = False)
    #print(sorted_df.index)
    #print(f"grp_df_=\n{grp_df_.head()}")

    fig =  px.box(grp_df_, y = sorted_df.index, points='all',
                  title=f"Descriptive stats for stations in King County, WA, with at least {rec_count}"
                  f" records for the period {period}.",
                  labels = { "value" : "WQI", 
                            "variable" : "Stations IDs (Locators)"}
                  ) 
    fig.add_hline(y=40, line_dash="dash", line_color="red")
    fig.add_annotation(x='D320', y=40, text="High concern")
    fig.add_hline(y=80, line_dash="dash", line_color="green")
    fig.add_annotation(x='D320', y=80, text="Low concern", ay=50)
    fig.update_xaxes(tickangle = tickangle_)
    return fig

plot_grp_stats(grp_c_df, C_GRP_COUNT, f"{YEAR_MIN}-{YEAR_MAX}; {len(grp_c_df.columns)} stations reporting").show()

As we can see, only three stations have a median about the Low concern level, and four fall into the category of High concern, including median 21.43 for Springbrook Creek mouth at SW 16th St close to Renton (Locator 317). Majority, i.e., 16 (about 70%) stations reported values with median within the Modern concern range. It poses the question where the stations are located? Does it depends on a geographical location? The following viz attempts to shed some light on it.

In [ ]:
"""
Prepares the figure based on several dataframes
    @param df_ The main WQI frame 
    @param location_df The dataframe with stations' location
    @param wqi_s_ The series such as median(), max(), min()
    @param col_name The name of the new column that describes the series
                    wqi_s_
    @param descr Title for the figure
    @ret figure
"""
def get_fig_density_scatter(df_, location_df, wqi_s_, col_name_, descr=""):
    # get the location of locators
    l_df = location_df.loc[location_df['Locator'].isin(df_)].reset_index()
    l_df.index.name = 'index'

    # so your index starts with 0, 1, 2, 
    wqi_df = wqi_s_.to_frame(name=col_name_).reset_index()
    wqi_df = wqi_df.rename(columns= {'index' : 'Locator', 0 : col_name_})
    # otherwise the name of the index is None
    wqi_df.index.name = 'index'
    
    # merge the dataframe with wqi values into a dataframe having
    # the location, stations names and stations' ids so we can 
    # nicely plot the clean locations
    l_df = l_df.merge(wqi_df[['Locator', col_name_]], on = 'Locator')
    print(f"l_df={l_df[['SiteName', 'Locator', col_name_]].sort_values(col_name_,  ascending = False)}")
    #print(f"wqi_df={wqi_df}")
    # stratify it because the trend is difficult to observe otherwise
    bins = [0, 40, 80, np.inf]
    # [0;40), [40;80), [80; ...), and since I can't find 
    # how to change a legend, so I have to do it with the long names now
    # the dataframe is not too big anyway, so adding extra bytes ...

    names = ['<40 high concern', '[40; 80) modern concern', '>=80 low concern']
    class_name = 'WQI_class'
    # get the cut with left inclusive and right exclusive
    l_df[class_name] = pd.cut(l_df[col_name_], bins, labels=names)
    #print(l_df)

    # this shows the continues scale but it does not show to much
    # so I decided to go with scatter_mapbox
    # uncomment if you want to see the continuous_scale
    #f = px.density_mapbox(l_df, lat = "lat", lon = "lng", radius = 10, 
    #                      center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8, 
    #                      hover_name = "SiteName", hover_data = ["Locator", col_name_],
    #                      mapbox_style = MAP_STYLE, title = descr,
    #                      z = col_name_,
    #                      color_continuous_scale='RdBu',
    #                     width = 800, height = 900)
    
    f = px.scatter_mapbox(l_df, lat = "lat", lon = "lng", 
                          center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG), zoom = 8, 
                          hover_name = "SiteName", hover_data = ["Locator", col_name_, class_name],
                          mapbox_style=MAP_STYLE, title = descr,
                          size = col_name_,
                          size_max = 20,
                          color = class_name,                          
                          #labels = { "legend" : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"} },
                          #legend = {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"},
                          color_discrete_map = {
                              names[0] : "red",
                              names[1]  : "yellow",
                              names[2]  : "blue"
                          },
                          category_orders = {
                              class_name : [names[2], names[1], names[0]]
                          },                          
                         width = 800, height = 900)
    #f.update_layout(legend={ labels : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"}})
    f.update_layout(legend_title_text="WQI class")
    
    return f

get_fig_density_scatter(grp_c_df,loc_df, grp_c_df.median(axis = 0),
                        "WQI_median",
                        f"Stations in King County, WA, with at least {C_GRP_COUNT} records for "
                        f"the period {YEAR_MIN}-{YEAR_MAX}.").show()
l_df=                                         SiteName Locator  WQI_median
17                    Green River at 212th Way SE    B319      85.250
20      Jenkins Creek at Kent Black Diamond Rd SE    D320      84.845
19  Covington Creek at SE Auburn-Black Diamond Rd    C320      83.760
16     Soos Creek mouth below Soos Creek Hatchery    A320      78.475
5       May Creek mouth at Lake Washington Blvd N     440      68.040
21               Little Soos Creek at SE 272nd St    G320      65.945
12        Issaquah Creek mouth at NW Sammamish Rd     631      64.695
11             Sammamish River at NE Marymoor Way     486      61.290
0                       Green River at Interurban     311      61.140
9              Little Bear mouth near NE 178th St     478      58.675
13                   Issaquah North Fork at mouth     632      58.050
22                                  Piper's Creek  KTHA01      57.835
7             Swamp Creek mouth at Bothell Way NE     470      57.450
18                Evans Creek at NE Union Hill Rd    B484      57.315
14                    Green River at Starfire Way    3106      55.100
10           Bear Creek downstream of Redmond Way     484      53.660
6            Juanita Creek mouth at NE Juanita Dr     446      47.735
3              Lyon Creek mouth at Bothell Way NE     430      47.200
8      North Creek mouth at Sammamish River Trail     474      45.720
2            Newaukum Creek mouth at 212th Way SE     322      39.330
4       Thornton Creek mouth at Sand Point Way NE     434      38.820
15      Mill Creek near 68th Ave S and S 262nd St    A315      37.420
1           Springbrook Creek mouth at SW 16th St     317      20.260

We can see that the Low concern stations are only in the south-eastern part of the region, close to Covington area:

Station Locator Median WQI
Green River at 212th Way SE B319 85.250
Jenkins Creek at Kent Black Diamond Rd SE D320 84.845
Covington Creek at SE Auburn-Black Diamond Rd C320 83.760

High concern stations locations are in the northern, and southern parts of the area. Surprisingly, one High concern station, namely Newaukum Creek mouth at 212th Way SE (322) with median WQI equaling 39.330 is in vicinity of the Low concern stations. The lowest median in the region belongs to Springbrook Creek mouth at SW 16th St (317), the station close to Renton, with median WQI = 20.260.

Station Locator Median WQI
Newaukum Creek mouth at 212th Way SE 322 39.330
Thornton Creek mouth at Sand Point Way NE 434 38.820
Mill Creek near 68th Ave S and S 262nd St A315 37.420
Springbrook Creek mouth at SW 16th St 317 20.260

The figure shows the worst WQI in terms of median values were reported at the stations: 317, A315, 434, and 322. All of them but 434 seem to be small creeks. On the contrary, only one of three the best WQIs, B319 seems to be measuring quality of a big river water, Green River; the map does not show any creeks for two of them.

It would be interesting to see how WQI changed over the years. Do changes depend on the year?

In [ ]:
# how we name strata with respect to WQI
CONCERN_STRATA = {"l" : "Low concern", "m" : "Modern concern", "h" : "High concern" }
In [ ]:
""" Stratifies the frame according to a median column into
    three classes: [80; ...), [40;80), [0;40) and names
    it according to CONCERN_STRATA
    
    @param df_ (in): a dataframe that has a column named 'median'
    @return The dataframe with a stratified column named WQI_class
            and values h, m, l depending on the median
"""
def get_stratified_median_df(df_):
    l_df = df_
    bins = [0, 40, 80, np.inf]
    # [0;40), [40;80), [80; ...)    
    names = [CONCERN_STRATA["h"],CONCERN_STRATA["m"], CONCERN_STRATA["l"]]
    class_name = 'WQI_class'
    
    # get the cut with left inclusive and right exclusive
    l_df[class_name] = pd.cut(l_df['median'], bins, labels=names)

    return l_df

#get_stratified_median_df(grp_c_stats_df)
In [ ]:
"""Plots yearly WQI and allows to select the locators easier by
    group selection and deselection
    @param df_ the group that you want to work with, should have years and WQI
               grouped by Locators as columns
    @param stratified_df_ a supporting dataframe that one column that
               classifies the locators per given metric such as median
               into groups that can be selected or deselected.
    @param title_ The title for the figure
    @return the Dash app
"""
def plot_yearly_WQI(df_, stratified_df_, title_):
    app = Dash(__name__)

    # we don't need 'min' and 'max' column
    stratified_df_ = stratified_df_.drop(['min', 'max'], axis=1)
    # get the reverse colors, i.e., do for your plotly color sequence [::-1]
    # and we don't want to make that color scale adjust every time within
    # that palette; I always want to have bad WQIs in the same red and
    # good WQIs in the same blue
    stratified_df_['Colorscale'] = get_custom_colorscale(
        px.colors.sequential.RdBu[::-1], len(stratified_df_.index))
    #print(stratified_df_)  
    #print(df_.head())
        
    app.layout = html.Div([
        #html.H4("AnnualScore WQI for all locators in Kings County, WA,"
        #        f" with at least {C_GRP_COUNT} records available"
        #        f" for years {YEAR_MIN}-{YEAR_MAX}. The legend is sorted in a "
        #        "descending order according to stations' WQI median."),        
        dcc.Graph(id="graph"),
        dcc.Checklist(
            id="checklist",
            options=[CONCERN_STRATA["l"], CONCERN_STRATA["m"], CONCERN_STRATA["h"]],
            value=[CONCERN_STRATA["l"], CONCERN_STRATA["m"], CONCERN_STRATA["h"]],
            inline=True, 
        ),
    ])

    @app.callback(
        Output("graph", "figure"), 
        Input("checklist", "value"))
    
    def update_line_chart(wqi):
        
        # check which locators qualify for the choice
        mask = stratified_df_.WQI_class.isin(wqi)
        tmp_df = stratified_df_[mask]   
        qualified_locators = tmp_df['Locator'].to_list()
        colorscale = tmp_df['Colorscale'].to_list()
        
        f = px.line(df_, x = df_.index, y = qualified_locators, 
                labels = {"value" : "WQI",
                          "variable" : "Station ID"},
                title = title_,
                height = 800,                
                color_discrete_sequence=colorscale,
                markers = True,
                template = "plotly_dark"
                )
        f.update_yaxes(range = [-5, 100])
        f.add_annotation(x=df_.index[0], y=40, text="High concern")
        f.add_annotation(x=df_.index[0], y=80, text="Low concern")
    
        return add_WQI_lines(f)

    return app

plot_yearly_WQI(grp_c_df, get_stratified_median_df(grp_c_stats_df),
                "AnnualScore WQI for all locators in Kings County, WA,"
                f" with at least {C_GRP_COUNT} records"
                f" for years {YEAR_MIN}-{YEAR_MAX}. The legend is sorted in a "
                "descending order according to stations' WQI median. "
                f"Total {len(grp_c_df.columns)} stations.").run_server(debug=True)

A few observations related to the historical values of WQI:

  • Low concern: in the history of B319 we can see several dips into

WQI $\approx$ 60 in 1993 and 1994, and more recently in 2007 where WQI = 67.

  • High concern: all locators show dramatic improvement in 2023 compared to

2022, $\Delta$ WQI $\approx$ [17, 30]. It might be because reporting for 2023 covers only part of year 2023. This locators' group seems to be on a recovery track. The worst times seem to be behind:

Locators Years when WQI was extremely low $\cap$ (Hill) years
322 1990-2002 2010-2017
434 1986-2007 -
A315 1980-2002 2007-2015
317 1977-1992 1992-2002

All of those stations apart from 434 show an interesting hill, i.e., increase-decrease in the past; however, in different timeframes. The locators recorded a relatively high WQI (A315, WQI $\approx$ 96 in 2010; 312, WQI $\approx$ 78 in 2012) and then a big decrease in water quality.

  • Modern concern: In general, they stay mostly in the entire range over the years,

occasionally going into both Low concern and High concern zone.

In order to understand how WQI has been shaped for the entire region, it would be interesting to compare how the median for the entire region was changing over the years.

In [ ]:
"""get sorted columns according to ascending WQI for a list of locators
    @param df_ (in): the frame that contains years and WQIs
    @param loc_list_ (in): the list of locators we are interested in
    @return dataframe with columns corresponding to WQIs and corresponding
                      years the WQI was reported, WQI sorted in ascending order

"""
def get_years_min_vals(df_, loc_list_):
    min_df = pd.DataFrame()
    for l in loc_list_:
        df = df_[l]
        df = df.sort_values().reset_index()
        #print(df)
        names = {"y" : f"{l}_year", "l" : f"{l}_WQI"}
        min_df[names["l"]] = df[l]
        min_df[names["y"]] = df['WaterYear']
        

    return min_df

get_years_min_vals(grp_c_df, ['322', '434', 'A315', '317'])
Out[ ]:
322_WQI 322_year 434_WQI 434_year A315_WQI A315_year 317_WQI 317_year
0 4.49 1991 12.18 2007 1.00 1980 1.00 1990
1 9.89 1996 15.88 1986 2.91 1979 1.00 1977
2 12.66 1998 17.56 1992 4.45 1984 1.00 1979
3 15.14 1990 18.26 1999 8.54 1991 1.00 1983
4 19.87 2000 20.21 2004 8.63 1989 1.00 1984
5 20.08 1984 21.53 2006 9.70 1998 1.19 1985
6 23.69 2017 22.94 1976 9.79 1985 1.40 1988
7 24.90 2002 23.54 1997 10.34 1983 2.36 1986
8 26.19 1983 23.62 2008 10.46 1990 4.13 1992
9 26.91 1993 25.87 1998 11.87 1981 4.25 1987
10 28.42 1997 27.19 1981 13.81 1978 4.42 1980
11 28.90 1995 27.28 2003 14.04 2002 5.26 1989
12 30.35 2001 27.38 1996 15.01 1986 6.66 2004
13 31.25 1999 27.83 1993 15.27 1987 6.68 2002
14 33.65 1988 31.26 1995 19.47 1997 7.03 1991
15 34.12 1979 31.37 2005 21.70 1988 7.98 2006
16 34.13 2007 31.54 1980 22.82 1982 9.43 1982
17 34.45 2019 33.16 2015 23.61 1999 11.07 1981
18 35.40 2010 33.75 2000 25.77 2001 12.32 2001
19 36.14 1978 34.36 2010 27.04 1992 15.88 2000
20 36.26 1986 34.41 1979 32.05 2000 16.71 2008
21 36.43 1994 35.07 2002 32.50 2015 18.46 2003
22 38.58 2016 35.37 1988 34.59 2007 18.84 2007
23 38.80 2021 37.42 2009 37.42 2004 19.56 1998
24 39.33 2009 38.82 1994 39.72 1995 20.96 1999
25 39.43 2005 38.82 2014 42.14 2005 21.16 2005
26 39.80 1989 39.06 1990 43.89 2003 21.43 2009
27 39.87 2015 39.68 1984 44.04 1994 25.09 2020
28 41.30 2006 40.72 1975 44.78 2022 26.63 2013
29 41.40 1980 41.01 2018 46.00 1996 26.65 2021
30 41.69 1981 41.80 1974 46.79 2008 26.68 1997
31 41.81 1987 42.46 2001 46.79 2006 26.69 2015
32 42.53 2014 44.29 2019 47.12 2019 27.14 2019
33 42.63 1992 44.35 1989 47.69 2020 30.97 1978
34 45.08 2003 44.77 2012 50.31 2021 31.56 2022
35 49.82 2008 46.60 1991 53.55 1993 32.50 2017
36 50.03 2022 46.85 1987 54.48 2014 32.67 2010
37 50.43 1985 49.67 2011 55.22 2018 34.04 2012
38 51.03 2004 50.06 1983 56.99 2017 34.67 2014
39 52.30 2013 53.30 2013 61.20 2013 35.81 2018
40 52.61 1982 53.63 1982 64.49 2016 41.90 2011
41 55.03 2018 55.80 2017 66.20 1977 43.92 2016
42 60.90 1973 56.94 2020 72.40 2009 44.50 1993
43 63.99 2020 57.49 1985 78.97 2023 55.10 1995
44 65.50 2011 58.04 2016 86.87 2012 56.39 2023
45 68.80 1977 59.00 2021 93.01 2011 66.17 1996
46 77.38 1972 62.84 1977 96.14 2010 66.37 1994
47 78.31 2012 64.05 2022 NaN 1970 92.03 1973
48 80.71 2023 81.49 2023 NaN 1971 NaN 1970
49 NaN 1970 84.46 1972 NaN 1972 NaN 1971
50 NaN 1971 NaN 1970 NaN 1973 NaN 1972
51 NaN 1974 NaN 1971 NaN 1974 NaN 1974
52 NaN 1975 NaN 1973 NaN 1975 NaN 1975
53 NaN 1976 NaN 1978 NaN 1976 NaN 1976

WQI Median By Year¶

First, we need to prepare the median dataframe.

In [ ]:
# create a global median for the region for a given year
g_median_df = grp_c_df.median(axis=1).reset_index().rename(columns={"WaterYear" : "Year", 0 : 'Global_median'})
g_median_df.index.name = 'index'
g_median_df.sort_values("Global_median", ascending=False )
Out[ ]:
Year Global_median
index
2 1972 87.380
53 2023 84.145
42 2012 72.720
41 2011 72.430
50 2020 70.790
52 2022 68.340
7 1977 66.260
47 2017 65.920
3 1973 64.585
46 2016 64.490
48 2018 63.860
15 1985 63.370
24 1994 62.950
51 2021 62.890
18 1988 62.310
43 2013 62.305
23 1993 62.170
35 2005 60.830
36 2006 60.460
12 1982 59.950
49 2019 59.110
40 2010 58.980
19 1989 58.650
39 2009 58.390
8 1978 58.300
33 2003 58.240
45 2015 57.860
34 2004 57.370
29 1999 56.900
17 1987 55.780
22 1992 55.680
28 1998 54.710
27 1997 54.660
44 2014 54.480
0 1970 54.410
26 1996 53.610
21 1991 53.500
20 1990 52.760
31 2001 51.370
13 1983 50.060
25 1995 49.840
38 2008 49.820
9 1979 48.340
1 1971 48.310
37 2007 47.540
11 1981 46.730
32 2002 46.400
30 2000 45.360
14 1984 43.990
16 1986 42.900
10 1980 41.400
5 1975 39.815
4 1974 39.420
6 1976 38.240

And it might be useful to get the number of stations in a given year, although it is expected that the number of stations should not vary significantly. For majority of the examined period there were 23 stations reporting.

In [ ]:
stations_count_yearly_grp_c_df = pd.DataFrame(grp_c_df.count(axis=1), columns = ["StationsCount"])
stations_count_yearly_grp_c_df
Out[ ]:
StationsCount
WaterYear
1970 2
1971 3
1972 19
1973 12
1974 9
1975 10
1976 15
1977 20
1978 12
1979 23
1980 23
1981 23
1982 23
1983 23
1984 23
1985 23
1986 23
1987 23
1988 23
1989 23
1990 23
1991 23
1992 23
1993 23
1994 23
1995 23
1996 23
1997 23
1998 23
1999 23
2000 23
2001 23
2002 23
2003 23
2004 23
2005 23
2006 23
2007 23
2008 23
2009 23
2010 19
2011 19
2012 19
2013 22
2014 21
2015 23
2016 23
2017 23
2018 23
2019 23
2020 23
2021 23
2022 22
2023 22
In [ ]:
stations_count_yearly_grp_c_df.describe()
Out[ ]:
StationsCount
count 54.000000
mean 20.740741
std 4.983691
min 2.000000
25% 22.000000
50% 23.000000
75% 23.000000
max 23.000000
In [ ]:
""" Plots the median per a given year, also plots the number of stations per 
    given year.

    @param grp_df_ (in) The group of stations we want to consider
    @param stations_count_df_ (in) The dataframe with he stations count
    @param rec_count (in) the number of records in that group avail.
    @return a plot with yearly median for the region represented by grp_df_

"""
def plot_grp_median_horizontally(grp_df_, stations_count_df_, rec_count):
    
    # merge the dataframe with wqi values into a dataframe having
    # the location, stations names and stations' ids so we can 
    # nicely plot the clean locations
    stations_count_df_ = stations_count_df_.reset_index().rename(columns={"WaterYear" : "Year"})
    
    df = grp_df_.merge(stations_count_df_)
    
    f = px.line(df, x="Year", y=["Global_median"], markers=True,
              title=f"The yearly WQI median  of all stations with at least "
              f"with at least {rec_count} records available "
              f"in King County, WA, and the number of stations for each year "              
              f"over years {YEAR_MIN}-{YEAR_MAX}.",
              height = 900,
              labels = { "value" : "WQI", "variable" : ""})
    f.update_layout(height = 900)
    f.update_xaxes(dtick=2)
    f.update_yaxes(dtick=10)

    f.add_annotation(x='1970', y=40, text="High concern")
    f.add_annotation(x='1970', y=80, text="Low concern")

    f.add_trace(go.Bar(x=df["Year"], y = df["StationsCount"],
                name="Stations Count"))
    f.update_yaxes(range = [-5, 100])
    
    return add_WQI_lines(f)    

#plot_grp_median_horizontally(g_median_df, stations_count_yearly_grp_c_df, C_GRP_COUNT).show()

If we look at the median for the entire region year by year, the quality of water remains within the Modern concern. Year 2023 is incomplete and as such should be excluded from the analysis or analyzed separately. Now, since we know a little bit more about the entire history of the region, let's take a look at the last ten years, excluding 2023 year. For that, we need to prepare a different dataframe.

In [ ]:
"""Plots a box plot with a line to connect medians across the year for the
region defined by grp_df_
    @param grp_df_ (in) The group of stations that we are interested in
    @param rec_count (in) The number of records guaranteed to be per
                          each station
    @param stations_count_df_ (in) The dataframe with the stations count
    @param period_ (in) string describing the years in the title
    @return fig The figure with medians across a given year with a line
                connecting medians for each year.
"""
def plot_yearly_medians(grp_df_, rec_count, stations_count_df_, period_):   
    # transpose the diagram
    df = grp_df_.transpose()
    median_series = df.median(axis = 0)

    stats_count_df = stations_count_df_.reset_index().rename(columns={"WaterYear" : "Year"})
    #print(median_series.to_list())
    #print(median_series.iloc[:,0])
    #print(median_series)
    #print(grp_df_.head())
    #print(df.head())
    fig =  px.box(df,  #points='all',
                  title=f"Yearly stats for King County, WA, including stations " 
                  f"with at least {rec_count}"
                  f" records for a period {period_}.",
                  labels = { "value" : "WQI", 
                            "variable" : "Stations IDs (Locators)"},
                  height = 900) 
    
    fig.add_hline(y=40, line_dash="dash", line_color="red")
    fig.add_annotation(x=0, y=40, text="High concern")
    fig.add_hline(y=80, line_dash="dash", line_color="green")
    fig.add_annotation(x=0, y=80, text="Low concern", ay=50)
    fig.update_xaxes(tickangle = 45)
    fig.update_layout(yaxis_title="WQI / Stations Count")
    fig.add_trace(go.Scatter(x = median_series.index, y = median_series.to_list(), 
                             name="Medians' line"))
    fig.add_trace(go.Bar(x=stats_count_df["Year"], y = stats_count_df["StationsCount"],
                name="Stations Count"))
    return fig

plot_yearly_medians(grp_c_df, C_GRP_COUNT, stations_count_yearly_grp_c_df, 
                    f"{YEAR_MIN}-{YEAR_MAX}").show()
In [ ]:
# the ten years' period that I want to examine
DELTA_YEARS = { 'start' : 2013, 'end' : 2022}

WQI for the Last 10 Years: 2013-2022¶

First, let's look at the dataframe. I removed the 2023 year as incomplete and all years prior to 2013.

In [ ]:
# the dataframe for the years I am interested in
df10 = df_an.loc[df_an.index.isin(range(DELTA_YEARS['start'], 1 + DELTA_YEARS['end']))]
print(df10.shape)
df10.info()
df10
(10, 78)
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 2013 to 2022
Data columns (total 78 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   0450CC         10 non-null     float64
 1   0484A          2 non-null      float64
 2   3106           10 non-null     float64
 3   311            8 non-null      float64
 4   317            10 non-null     float64
 5   321            10 non-null     float64
 6   322            10 non-null     float64
 7   430            10 non-null     float64
 8   434            10 non-null     float64
 9   438            10 non-null     float64
 10  440            10 non-null     float64
 11  442            10 non-null     float64
 12  444            10 non-null     float64
 13  446            10 non-null     float64
 14  470            10 non-null     float64
 15  474            10 non-null     float64
 16  478            10 non-null     float64
 17  484            9 non-null      float64
 18  486            10 non-null     float64
 19  631            10 non-null     float64
 20  632            9 non-null      float64
 21  A315           10 non-null     float64
 22  A319           8 non-null      float64
 23  A320           10 non-null     float64
 24  A432           10 non-null     float64
 25  A438           8 non-null      float64
 26  A456           10 non-null     float64
 27  A499           8 non-null      float64
 28  A617           10 non-null     float64
 29  A620           10 non-null     float64
 30  A631           10 non-null     float64
 31  A670           8 non-null      float64
 32  A680           10 non-null     float64
 33  A685           10 non-null     float64
 34  A687           4 non-null      float64
 35  A690           8 non-null      float64
 36  AMES_1         10 non-null     float64
 37  B319           10 non-null     float64
 38  B484           10 non-null     float64
 39  B499           6 non-null      float64
 40  BB470          8 non-null      float64
 41  BSE_1MUDMTNRD  10 non-null     float64
 42  C320           10 non-null     float64
 43  C370           10 non-null     float64
 44  C446           0 non-null      float64
 45  C484           10 non-null     float64
 46  CHERRY_1       10 non-null     float64
 47  D320           10 non-null     float64
 48  D474           8 non-null      float64
 49  F321           10 non-null     float64
 50  G320           10 non-null     float64
 51  GRIFFIN        10 non-null     float64
 52  HARRIS_1       10 non-null     float64
 53  J484           2 non-null      float64
 54  KSHZ06         10 non-null     float64
 55  KTHA01         10 non-null     float64
 56  KTHA02         8 non-null      float64
 57  KTHA03         10 non-null     float64
 58  LSIN1          10 non-null     float64
 59  LSIN9          10 non-null     float64
 60  MFK_SNQ        10 non-null     float64
 61  N484           10 non-null     float64
 62  NFK_SNQ        10 non-null     float64
 63  PATTER_3       10 non-null     float64
 64  RAGING_MTH     10 non-null     float64
 65  S478           8 non-null      float64
 66  S484           8 non-null      float64
 67  SFK_SNQ        10 non-null     float64
 68  SKYKOMISH      10 non-null     float64
 69  SNQDUVALL      10 non-null     float64
 70  TOLT_MTH       10 non-null     float64
 71  VA12A          9 non-null      float64
 72  VA37A          9 non-null      float64
 73  VA41A          9 non-null      float64
 74  VA42A          10 non-null     float64
 75  VA45A          9 non-null      float64
 76  VA65A          10 non-null     float64
 77  X630           10 non-null     float64
dtypes: float64(78)
memory usage: 6.2 KB
Out[ ]:
0450CC 0484A 3106 311 317 321 322 430 434 438 440 442 444 446 470 474 478 484 486 631 632 A315 A319 A320 A432 A438 A456 A499 A617 A620 A631 A670 A680 A685 A687 A690 AMES_1 B319 B484 B499 BB470 BSE_1MUDMTNRD C320 C370 C446 C484 CHERRY_1 D320 D474 F321 G320 GRIFFIN HARRIS_1 J484 KSHZ06 KTHA01 KTHA02 KTHA03 LSIN1 LSIN9 MFK_SNQ N484 NFK_SNQ PATTER_3 RAGING_MTH S478 S484 SFK_SNQ SKYKOMISH SNQDUVALL TOLT_MTH VA12A VA37A VA41A VA42A VA45A VA65A X630
WaterYear
2013 39.22 NaN 63.99 NaN 26.63 86.67 52.30 63.62 53.30 82.60 80.50 77.87 36.54 61.18 45.94 25.88 69.49 59.66 46.11 73.94 63.41 61.20 NaN 87.70 75.60 NaN 50.96 NaN 76.87 69.29 80.26 NaN 67.15 91.89 NaN NaN 52.88 88.90 27.57 NaN NaN 59.33 78.24 52.09 NaN 63.53 86.85 90.55 NaN 93.28 65.76 85.29 91.59 NaN 52.88 54.33 NaN 83.13 76.68 99.93 77.62 69.23 74.85 66.55 81.25 NaN NaN 62.20 87.91 78.81 91.50 NaN NaN NaN 77.55 NaN 72.66 53.42
2014 40.20 NaN 66.95 NaN 34.67 56.92 42.53 41.95 38.82 83.19 64.23 47.38 21.33 56.70 40.08 27.83 54.81 41.31 44.51 53.75 NaN 54.48 NaN 82.39 53.95 NaN 35.01 NaN 55.55 71.47 52.59 NaN 54.71 76.32 NaN NaN 46.09 84.33 35.54 NaN NaN 75.28 69.95 39.42 NaN 44.48 87.21 84.28 NaN 91.13 59.50 85.18 82.95 NaN 44.29 54.51 NaN 61.60 86.65 99.37 81.20 62.31 85.19 61.82 67.26 NaN NaN 66.86 87.73 88.20 85.16 89.46 77.17 60.44 49.14 30.41 70.40 41.84
2015 35.69 NaN 62.42 62.91 26.69 62.54 39.87 55.76 33.16 78.78 76.42 70.58 41.08 61.86 57.45 34.86 45.59 48.61 57.86 65.66 44.39 32.50 81.03 84.30 44.40 83.06 35.49 53.69 69.94 69.77 76.10 72.49 72.48 89.12 NaN 81.36 43.03 84.78 32.04 NaN 64.15 71.82 75.59 48.77 NaN 46.97 83.52 83.87 46.83 95.05 60.68 86.93 88.20 74.45 48.49 76.30 39.59 63.21 11.25 83.47 67.61 64.06 82.79 69.68 64.93 69.52 37.72 60.61 84.26 82.32 79.20 74.61 72.76 66.32 58.33 33.95 65.70 39.65
2016 41.45 NaN 58.55 56.27 43.92 62.32 38.58 67.18 58.04 66.29 51.59 57.33 44.48 57.24 67.98 40.83 68.68 60.42 62.45 69.11 70.30 64.49 60.31 88.81 71.11 90.54 29.52 55.72 53.50 63.43 71.81 75.00 77.57 77.44 NaN 81.62 44.29 78.88 49.31 NaN 62.31 51.06 90.49 49.94 NaN 65.86 91.11 86.31 68.32 69.50 70.92 91.47 90.21 80.92 36.44 72.16 28.15 72.96 30.99 87.16 80.06 77.89 86.43 53.48 89.79 71.72 41.02 77.65 89.41 54.18 89.02 50.54 56.33 42.18 52.45 35.15 38.81 46.38
2017 55.91 NaN 72.43 70.98 32.50 60.13 23.69 59.61 55.80 87.90 61.54 63.31 58.70 65.92 69.09 45.72 57.44 65.70 57.45 66.96 66.02 56.99 79.89 76.91 62.43 87.79 66.75 83.99 74.31 63.58 70.91 78.77 73.66 85.07 NaN 78.24 61.31 84.69 52.20 75.10 74.16 70.07 86.61 49.92 NaN 66.18 90.45 91.62 65.22 94.51 75.22 92.46 92.96 NaN 44.93 81.92 47.96 57.36 22.72 91.96 83.81 76.95 86.28 66.57 79.96 65.48 47.36 89.38 92.05 77.50 86.98 63.63 53.97 59.29 37.49 44.69 58.77 46.96
2018 65.77 NaN 66.96 65.76 35.81 56.57 55.03 63.86 41.01 83.55 75.81 82.72 55.80 66.66 57.48 41.66 61.01 72.83 63.59 65.34 37.08 55.22 86.64 85.58 69.32 90.58 62.35 82.97 63.51 63.57 77.02 47.63 79.17 84.95 NaN 72.42 60.09 90.77 48.27 62.88 74.22 72.38 78.97 43.72 NaN 76.73 81.73 88.56 62.01 95.76 75.05 89.17 87.79 NaN 63.44 62.63 63.31 73.95 18.42 85.47 78.11 80.05 82.36 56.10 78.62 84.57 42.41 83.57 92.81 69.00 87.73 78.33 78.85 70.18 70.18 24.11 72.29 47.06
2019 32.79 NaN 68.22 68.28 27.14 57.94 34.45 44.58 44.29 85.04 66.73 80.48 42.50 60.01 53.55 47.51 59.96 55.95 58.49 59.11 57.12 47.12 75.70 83.25 53.73 85.55 34.82 75.86 64.82 62.00 63.04 80.33 78.24 90.22 72.67 85.45 76.14 87.38 40.03 49.72 72.64 70.02 77.73 56.23 NaN 58.63 90.29 85.14 48.17 95.72 66.13 92.85 92.49 NaN 65.34 66.78 66.21 81.09 28.10 84.88 84.68 66.95 91.10 77.12 80.16 72.53 32.57 83.36 89.99 83.94 90.98 81.47 69.81 76.89 74.91 34.70 73.79 54.34
2020 62.50 NaN 72.85 75.55 25.09 62.46 63.99 69.91 56.94 87.70 81.81 86.24 54.35 67.64 69.74 47.49 81.21 81.00 70.76 79.11 70.79 47.69 79.32 84.73 67.70 87.20 67.86 82.47 85.30 80.76 77.06 77.11 82.31 87.09 81.00 78.23 53.86 87.42 53.80 67.06 87.20 74.76 83.60 52.60 NaN 84.58 81.07 89.10 70.43 95.44 79.61 86.00 80.94 NaN 49.97 56.44 59.53 80.00 18.62 91.60 67.49 85.02 79.68 63.13 58.93 83.63 37.92 57.46 92.90 64.77 75.65 86.31 69.56 78.02 76.12 35.61 77.56 72.79
2021 43.73 72.12 66.44 62.89 26.65 60.60 38.80 68.43 59.00 72.80 67.06 56.78 41.14 61.53 61.24 48.14 67.38 71.62 47.22 72.85 49.49 50.31 71.23 71.27 66.62 83.47 65.00 78.63 58.52 49.28 73.87 62.38 67.01 75.32 67.48 86.11 51.32 83.81 27.80 66.01 76.17 76.31 78.19 48.91 NaN 66.95 79.36 77.65 66.21 91.92 67.12 84.17 91.02 NaN 55.71 38.36 30.38 64.13 7.13 91.20 73.77 72.56 85.72 69.06 63.23 82.41 35.08 65.47 88.69 75.77 81.06 67.88 29.74 33.14 59.38 19.78 52.87 42.49
2022 48.03 69.78 72.72 70.42 31.56 54.33 50.03 78.68 64.05 87.05 79.95 69.99 41.24 71.33 64.29 52.62 66.26 NaN 54.30 76.68 33.41 44.78 76.63 75.39 75.22 90.72 63.93 84.70 67.23 71.99 80.58 69.16 76.84 73.12 79.96 84.24 52.75 93.22 51.97 67.51 77.87 55.98 79.39 60.28 NaN 73.67 87.18 88.20 63.49 95.60 70.42 92.50 84.69 NaN 64.40 57.99 60.66 77.25 18.52 90.99 84.86 74.37 86.65 64.31 87.36 81.90 38.37 80.05 94.14 87.25 94.85 80.86 68.57 64.91 75.73 52.90 55.20 52.20

Below, there are stations in the decreasing order of the count of available records for the given time period.

In [ ]:
# frequencies of the number of records per station for the given 10-year
# period; this is a series
series10_rec_freq = get_and_sort_freqs(df10)
series10_rec_freq
Out[ ]:
0450CC           10
A680             10
AMES_1           10
B319             10
B484             10
BSE_1MUDMTNRD    10
C320             10
C370             10
C484             10
CHERRY_1         10
D320             10
F321             10
G320             10
GRIFFIN          10
HARRIS_1         10
KSHZ06           10
KTHA01           10
KTHA03           10
LSIN1            10
LSIN9            10
MFK_SNQ          10
N484             10
NFK_SNQ          10
PATTER_3         10
RAGING_MTH       10
SFK_SNQ          10
SKYKOMISH        10
SNQDUVALL        10
TOLT_MTH         10
VA42A            10
VA65A            10
A685             10
X630             10
478              10
317              10
438              10
440              10
442              10
444              10
446              10
470              10
474              10
430              10
322              10
486              10
631              10
321              10
A315             10
A320             10
A456             10
3106             10
A631             10
A620             10
A617             10
434              10
A432             10
484               9
VA41A             9
VA45A             9
VA37A             9
VA12A             9
632               9
S478              8
311               8
S484              8
A670              8
KTHA02            8
D474              8
A319              8
A438              8
A499              8
BB470             8
A690              8
B499              6
A687              4
J484              2
0484A             2
C446              0
dtype: int64
In [ ]:
# show the histogram of the record availability frequencies
plot_freqs(series10_rec_freq, 
           f"Frequency of available records for timeframe "
           f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']} "
           f"available at a locator.").show()

Let's see where the stations that have 8, 9, and 10 records available are located. It will help decide whether we should only focus on locators with 10 records available.

In [ ]:
# remove the stations that with frequencies less than 8 available records
series10_rec_freq = series10_rec_freq[series10_rec_freq >= 8]
In [ ]:
""" Change the series to the dataframe 
    @param s_ (in): The series to be changed
    @param col_name_ (in) : the name what the series represent
    @param reset_idx_ (in): True the index will be reset, otherwise not
    @param idx_name_ (in) : how the index will be named, only matters if 
                            reset_idx is set to True
    
    @return df : the dataframe corresponding to a s_ with index as a first 
                 column and series as the second column 
"""
def series2df(s_, col_name_, idx_name_ = 'index', reset_idx_ = True  ):
    
    if (reset_idx_):
        # so your index starts with 0, 1, 2, 
        df = s_.to_frame(name=col_name_ ).reset_index()
        df = df.rename(columns= {'index' : idx_name_})
        # otherwise the name of the index is None
        df.index.name = 'index'
    else:
        # just make it a frame
        df = s_.to_frame(name=col_name_)

    return df

""" Prepares the record frequency series for drawing and studying
    @param s_ (in) series to be changed, i.e., records count per station
    @param loc_df_ (in) The dataframe with station locations that will
                        be merged to the resultant location

    @return df The combined dataframe that has stations' data and records
               counting
    
"""
def prepare_rec_freq_series(s_, loc_df_):
    # convert the series to a dataframe with proper column names
    # the name of the default column
    COL1 = 'Rec_count'
    rec_count_df = series2df(s_, COL1, 'Locator')
    # stratify the the dataframe by the Rec_count column
    bins = [0, 8, 9, 10]
    labels = ['8', '9', '10'] 
    rec_count_df['Rec_count_lbl'] = pd.cut(rec_count_df[COL1], bins = bins, labels=labels)
    
    #print(rec_count_df)

    # merge the dataframe with wqi values into a dataframe having
    # the location, stations names and stations' ids so we can 
    # nicely plot the clean locations
    # print(loc_df_.columns)
    df = rec_count_df.merge(loc_df_[['Locator', 'SiteName', 'lng', 'lat']], on = 'Locator')    
    #print(df)

    return df

# 10 last years records count per stations
rec_freq_10_df = prepare_rec_freq_series(series10_rec_freq, loc_df)
In [ ]:
"""
    Prepares the figure to figure out where the stations with 
    a given number of records are

    @param rec_freq_df_ (in) main dataframe to be presented
    @param title_ (in) The title of the figure

    @return figure
"""
def plot_density_scatter(df_, title_=""):
    
    #print(df_)

    f = px.scatter_mapbox(df_, lat = "lat", lon = "lng", 
                          center = dict (lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG+0.2), zoom = 8, 
                          hover_name = "SiteName", hover_data = ["Locator", "Rec_count"],
                          mapbox_style = MAP_STYLE, title = title_,
                          size = 'Rec_count',
                          size_max = 20,
                          color = 'Rec_count_lbl',                          
                          #labels = { "legend" : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"} },
                          #legend = {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"},
                          color_discrete_map = {
                              '8' : "red",
                              '9'  : "yellow",
                              '10'  : "blue"
                              
                          },
                          category_orders = {
                              'Rec_count_lbl' : [ '10', '9', '8']
                          },                          
                         width = 800, height = 900)
    #f.update_layout(legend={ labels : {"-1" : "<40", "0" : "[40; 80)", "1" : ">=80"}})
    f.update_layout(legend_title_text="Record count")
    
    return f

plot_density_scatter(rec_freq_10_df,
                        "Stations in King County, WA, grouped by the number of " 
                        f"records available for the period "
                        f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']}.").show()

The coverage of the region by stations holding 10 records if pretty comprehensive. The stations with 9 records are mostly on Vashon Island (4 records). Two other stations are located in Redmond and Issaquah. The 11 stations holding 8 records to some extent overlap area with stations holding 10 records. There are three northern locations that are not covered by any other stations (North Creek), one in Bellevue (Cochran Springs), and one at the East Renton area (Cedar River).

As we can see below, the missing years are mostly 2013 and 2014, apart from one station 484 for which only the record from 2022 is unavailable. So in order to have a longer timeframe and a decent region coverage I will use only stations that offer 10 records for the last 10 years. Another option would be to exclude 2013 and 2014 years from the analysis and focus on the period 2015-2022.

In [ ]:
# check which years are missing
locs = rec_freq_10_df[rec_freq_10_df['Rec_count'] < 10]['Locator']
get_missing_years(dat_fr=df10,  loc_list = locs.to_list())
Out[ ]:
[['484', [2022]],
 ['VA41A', [2013]],
 ['VA45A', [2013]],
 ['VA37A', [2013]],
 ['VA12A', [2013]],
 ['632', [2014]],
 ['S478', [2013, 2014]],
 ['311', [2013, 2014]],
 ['S484', [2013, 2014]],
 ['A670', [2013, 2014]],
 ['KTHA02', [2013, 2014]],
 ['D474', [2013, 2014]],
 ['A319', [2013, 2014]],
 ['A438', [2013, 2014]],
 ['A499', [2013, 2014]],
 ['BB470', [2013, 2014]],
 ['A690', [2013, 2014]]]
In [ ]:
# that is a dataframe with all 10 records available
only10_df = df10.loc[:, df10.columns.isin (series10_rec_freq[series10_rec_freq == 10].index)]
print(only10_df)
print(only10_df.info())
print(only10_df.describe())
           0450CC   3106    317    321    322    430    434    438    440  \
WaterYear                                                                   
2013        39.22  63.99  26.63  86.67  52.30  63.62  53.30  82.60  80.50   
2014        40.20  66.95  34.67  56.92  42.53  41.95  38.82  83.19  64.23   
2015        35.69  62.42  26.69  62.54  39.87  55.76  33.16  78.78  76.42   
2016        41.45  58.55  43.92  62.32  38.58  67.18  58.04  66.29  51.59   
2017        55.91  72.43  32.50  60.13  23.69  59.61  55.80  87.90  61.54   
2018        65.77  66.96  35.81  56.57  55.03  63.86  41.01  83.55  75.81   
2019        32.79  68.22  27.14  57.94  34.45  44.58  44.29  85.04  66.73   
2020        62.50  72.85  25.09  62.46  63.99  69.91  56.94  87.70  81.81   
2021        43.73  66.44  26.65  60.60  38.80  68.43  59.00  72.80  67.06   
2022        48.03  72.72  31.56  54.33  50.03  78.68  64.05  87.05  79.95   

             442    444    446    470    474    478    486    631   A315  \
WaterYear                                                                  
2013       77.87  36.54  61.18  45.94  25.88  69.49  46.11  73.94  61.20   
2014       47.38  21.33  56.70  40.08  27.83  54.81  44.51  53.75  54.48   
2015       70.58  41.08  61.86  57.45  34.86  45.59  57.86  65.66  32.50   
2016       57.33  44.48  57.24  67.98  40.83  68.68  62.45  69.11  64.49   
2017       63.31  58.70  65.92  69.09  45.72  57.44  57.45  66.96  56.99   
2018       82.72  55.80  66.66  57.48  41.66  61.01  63.59  65.34  55.22   
2019       80.48  42.50  60.01  53.55  47.51  59.96  58.49  59.11  47.12   
2020       86.24  54.35  67.64  69.74  47.49  81.21  70.76  79.11  47.69   
2021       56.78  41.14  61.53  61.24  48.14  67.38  47.22  72.85  50.31   
2022       69.99  41.24  71.33  64.29  52.62  66.26  54.30  76.68  44.78   

            A320   A432   A456   A617   A620   A631   A680   A685  AMES_1  \
WaterYear                                                                   
2013       87.70  75.60  50.96  76.87  69.29  80.26  67.15  91.89   52.88   
2014       82.39  53.95  35.01  55.55  71.47  52.59  54.71  76.32   46.09   
2015       84.30  44.40  35.49  69.94  69.77  76.10  72.48  89.12   43.03   
2016       88.81  71.11  29.52  53.50  63.43  71.81  77.57  77.44   44.29   
2017       76.91  62.43  66.75  74.31  63.58  70.91  73.66  85.07   61.31   
2018       85.58  69.32  62.35  63.51  63.57  77.02  79.17  84.95   60.09   
2019       83.25  53.73  34.82  64.82  62.00  63.04  78.24  90.22   76.14   
2020       84.73  67.70  67.86  85.30  80.76  77.06  82.31  87.09   53.86   
2021       71.27  66.62  65.00  58.52  49.28  73.87  67.01  75.32   51.32   
2022       75.39  75.22  63.93  67.23  71.99  80.58  76.84  73.12   52.75   

            B319   B484  BSE_1MUDMTNRD   C320   C370   C484  CHERRY_1   D320  \
WaterYear                                                                      
2013       88.90  27.57          59.33  78.24  52.09  63.53     86.85  90.55   
2014       84.33  35.54          75.28  69.95  39.42  44.48     87.21  84.28   
2015       84.78  32.04          71.82  75.59  48.77  46.97     83.52  83.87   
2016       78.88  49.31          51.06  90.49  49.94  65.86     91.11  86.31   
2017       84.69  52.20          70.07  86.61  49.92  66.18     90.45  91.62   
2018       90.77  48.27          72.38  78.97  43.72  76.73     81.73  88.56   
2019       87.38  40.03          70.02  77.73  56.23  58.63     90.29  85.14   
2020       87.42  53.80          74.76  83.60  52.60  84.58     81.07  89.10   
2021       83.81  27.80          76.31  78.19  48.91  66.95     79.36  77.65   
2022       93.22  51.97          55.98  79.39  60.28  73.67     87.18  88.20   

            F321   G320  GRIFFIN  HARRIS_1  KSHZ06  KTHA01  KTHA03  LSIN1  \
WaterYear                                                                   
2013       93.28  65.76    85.29     91.59   52.88   54.33   83.13  76.68   
2014       91.13  59.50    85.18     82.95   44.29   54.51   61.60  86.65   
2015       95.05  60.68    86.93     88.20   48.49   76.30   63.21  11.25   
2016       69.50  70.92    91.47     90.21   36.44   72.16   72.96  30.99   
2017       94.51  75.22    92.46     92.96   44.93   81.92   57.36  22.72   
2018       95.76  75.05    89.17     87.79   63.44   62.63   73.95  18.42   
2019       95.72  66.13    92.85     92.49   65.34   66.78   81.09  28.10   
2020       95.44  79.61    86.00     80.94   49.97   56.44   80.00  18.62   
2021       91.92  67.12    84.17     91.02   55.71   38.36   64.13   7.13   
2022       95.60  70.42    92.50     84.69   64.40   57.99   77.25  18.52   

           LSIN9  MFK_SNQ   N484  NFK_SNQ  PATTER_3  RAGING_MTH  SFK_SNQ  \
WaterYear                                                                  
2013       99.93    77.62  69.23    74.85     66.55       81.25    62.20   
2014       99.37    81.20  62.31    85.19     61.82       67.26    66.86   
2015       83.47    67.61  64.06    82.79     69.68       64.93    60.61   
2016       87.16    80.06  77.89    86.43     53.48       89.79    77.65   
2017       91.96    83.81  76.95    86.28     66.57       79.96    89.38   
2018       85.47    78.11  80.05    82.36     56.10       78.62    83.57   
2019       84.88    84.68  66.95    91.10     77.12       80.16    83.36   
2020       91.60    67.49  85.02    79.68     63.13       58.93    57.46   
2021       91.20    73.77  72.56    85.72     69.06       63.23    65.47   
2022       90.99    84.86  74.37    86.65     64.31       87.36    80.05   

           SKYKOMISH  SNQDUVALL  TOLT_MTH  VA42A  VA65A   X630  
WaterYear                                                       
2013           87.91      78.81     91.50  77.55  72.66  53.42  
2014           87.73      88.20     85.16  49.14  70.40  41.84  
2015           84.26      82.32     79.20  58.33  65.70  39.65  
2016           89.41      54.18     89.02  52.45  38.81  46.38  
2017           92.05      77.50     86.98  37.49  58.77  46.96  
2018           92.81      69.00     87.73  70.18  72.29  47.06  
2019           89.99      83.94     90.98  74.91  73.79  54.34  
2020           92.90      64.77     75.65  76.12  77.56  72.79  
2021           88.69      75.77     81.06  59.38  52.87  42.49  
2022           94.14      87.25     94.85  75.73  55.20  52.20  
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 2013 to 2022
Data columns (total 56 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   0450CC         10 non-null     float64
 1   3106           10 non-null     float64
 2   317            10 non-null     float64
 3   321            10 non-null     float64
 4   322            10 non-null     float64
 5   430            10 non-null     float64
 6   434            10 non-null     float64
 7   438            10 non-null     float64
 8   440            10 non-null     float64
 9   442            10 non-null     float64
 10  444            10 non-null     float64
 11  446            10 non-null     float64
 12  470            10 non-null     float64
 13  474            10 non-null     float64
 14  478            10 non-null     float64
 15  486            10 non-null     float64
 16  631            10 non-null     float64
 17  A315           10 non-null     float64
 18  A320           10 non-null     float64
 19  A432           10 non-null     float64
 20  A456           10 non-null     float64
 21  A617           10 non-null     float64
 22  A620           10 non-null     float64
 23  A631           10 non-null     float64
 24  A680           10 non-null     float64
 25  A685           10 non-null     float64
 26  AMES_1         10 non-null     float64
 27  B319           10 non-null     float64
 28  B484           10 non-null     float64
 29  BSE_1MUDMTNRD  10 non-null     float64
 30  C320           10 non-null     float64
 31  C370           10 non-null     float64
 32  C484           10 non-null     float64
 33  CHERRY_1       10 non-null     float64
 34  D320           10 non-null     float64
 35  F321           10 non-null     float64
 36  G320           10 non-null     float64
 37  GRIFFIN        10 non-null     float64
 38  HARRIS_1       10 non-null     float64
 39  KSHZ06         10 non-null     float64
 40  KTHA01         10 non-null     float64
 41  KTHA03         10 non-null     float64
 42  LSIN1          10 non-null     float64
 43  LSIN9          10 non-null     float64
 44  MFK_SNQ        10 non-null     float64
 45  N484           10 non-null     float64
 46  NFK_SNQ        10 non-null     float64
 47  PATTER_3       10 non-null     float64
 48  RAGING_MTH     10 non-null     float64
 49  SFK_SNQ        10 non-null     float64
 50  SKYKOMISH      10 non-null     float64
 51  SNQDUVALL      10 non-null     float64
 52  TOLT_MTH       10 non-null     float64
 53  VA42A          10 non-null     float64
 54  VA65A          10 non-null     float64
 55  X630           10 non-null     float64
dtypes: float64(56)
memory usage: 4.5 KB
None
          0450CC       3106        317        321        322        430  \
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000   
mean   46.529000  67.153000  31.066000  62.048000  43.927000  61.358000   
std    11.298463   4.707403   5.892317   9.099769  11.574836  11.358006   
min    32.790000  58.550000  25.090000  54.330000  23.690000  41.950000   
25%    39.465000  64.602500  26.660000  57.175000  38.635000  56.722500   
50%    42.590000  66.955000  29.350000  60.365000  41.200000  63.740000   
75%    53.940000  71.377500  34.127500  62.425000  51.732500  68.117500   
max    65.770000  72.850000  43.920000  86.670000  63.990000  78.680000   

             434        438        440        442       444        446  \
count  10.000000  10.000000  10.000000  10.000000  10.00000  10.000000   
mean   50.441000  81.490000  70.564000  69.268000  43.71600  63.007000   
std    10.305312   7.035104   9.908746  12.840904  10.84246   4.730758   
min    33.160000  66.290000  51.590000  47.380000  21.33000  56.700000   
25%    41.830000  79.735000  64.855000  58.825000  41.09500  60.302500   
50%    54.550000  83.370000  71.435000  70.285000  41.87000  61.695000   
75%    57.765000  86.547500  79.067500  79.827500  51.88250  66.475000   
max    64.050000  87.900000  81.810000  86.240000  58.70000  71.330000   

             470        474        478       486        631       A315  \
count  10.000000  10.000000  10.000000  10.00000  10.000000  10.000000   
mean   58.684000  41.254000  63.183000  56.27400  68.251000  51.478000   
std     9.959611   9.020747   9.716899   8.41883   7.845098   9.170964   
min    40.080000  25.880000  45.590000  44.51000  53.750000  32.500000   
25%    54.525000  36.352500  58.070000  48.99000  65.420000  47.262500   
50%    59.360000  43.690000  63.635000  57.65500  68.035000  52.395000   
75%    67.057500  47.505000  68.355000  61.46000  73.667500  56.547500   
max    69.740000  52.620000  81.210000  70.76000  79.110000  64.490000   

            A320     A432       A456       A617       A620       A631  \
count  10.000000  10.0000  10.000000  10.000000  10.000000  10.000000   
mean   82.033000  64.0080  51.169000  66.955000  66.514000  72.324000   
std     5.684894  10.2900  15.788422   9.981175   8.299787   8.629143   
min    71.270000  44.4000  29.520000  53.500000  49.280000  52.590000   
25%    78.280000  56.0700  35.130000  59.767500  63.465000  71.135000   
50%    83.775000  67.1600  56.655000  66.025000  66.435000  74.985000   
75%    85.367500  70.6625  64.732500  73.217500  71.045000  77.050000   
max    88.810000  75.6000  67.860000  85.300000  80.760000  80.580000   

            A680       A685     AMES_1       B319       B484  BSE_1MUDMTNRD  \
count  10.000000  10.000000  10.000000  10.000000  10.000000      10.000000   
mean   72.914000  83.054000  54.176000  86.418000  41.853000      67.701000   
std     8.117826   6.876159   9.818384   4.040747  10.487521       8.916982   
min    54.710000  73.120000  43.030000  78.880000  27.570000      51.060000   
25%    68.482500  76.600000  47.397500  84.420000  32.915000      62.002500   
50%    75.250000  85.010000  52.815000  86.080000  44.150000      70.945000   
75%    78.072500  88.612500  58.532500  88.530000  51.305000      74.165000   
max    82.310000  91.890000  76.140000  93.220000  53.800000      76.310000   

            C320       C370       C484   CHERRY_1       D320       F321  \
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000   
mean   79.876000  50.188000  64.758000  85.877000  86.528000  91.791000   
std     5.773008   5.862799  12.429249   4.223272   4.062787   8.004076   
min    69.950000  39.420000  44.480000  79.360000  77.650000  69.500000   
25%    77.845000  48.805000  59.855000  82.177500  84.495000  92.260000   
50%    78.605000  49.930000  66.020000  87.015000  87.255000  94.780000   
75%    82.547500  52.472500  71.990000  89.520000  88.965000  95.560000   
max    90.490000  60.280000  84.580000  91.110000  91.620000  95.760000   

           G320    GRIFFIN   HARRIS_1     KSHZ06     KTHA01     KTHA03  \
count  10.00000  10.000000  10.000000  10.000000  10.000000  10.000000   
mean   69.04100  88.602000  88.284000  52.589000  62.142000  71.468000   
std     6.45747   3.473074   4.179767   9.670356  12.685679   9.199348   
min    59.50000  84.170000  80.940000  36.440000  38.360000  57.360000   
25%    65.85250  85.467500  85.465000  45.820000  54.992500  63.440000   
50%    68.77000  88.050000  89.205000  51.425000  60.310000  73.455000   
75%    74.01750  92.212500  91.447500  61.507500  70.815000  79.312500   
max    79.61000  92.850000  92.960000  65.340000  81.920000  83.130000   

           LSIN1      LSIN9    MFK_SNQ       N484    NFK_SNQ   PATTER_3  \
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000   
mean   31.908000  90.603000  77.921000  72.939000  84.105000  64.782000   
std    27.242572   5.666953   6.466331   7.308639   4.463913   6.796064   
min     7.130000  83.470000  67.490000  62.310000  74.850000  53.480000   
25%    18.445000  85.892500  74.732500  67.520000  82.467500  62.147500   
50%    20.670000  91.095000  79.085000  73.465000  85.455000  65.430000   
75%    30.267500  91.870000  83.157500  77.655000  86.392500  68.437500   
max    86.650000  99.930000  84.860000  85.020000  91.100000  77.120000   

       RAGING_MTH    SFK_SNQ  SKYKOMISH  SNQDUVALL  TOLT_MTH      VA42A  \
count   10.000000  10.000000  10.000000  10.000000   10.0000  10.000000   
mean    75.149000  72.661000  89.989000  76.174000   86.2130  63.128000   
std     10.712278  11.370424   3.021771  10.747145    6.0043  13.857216   
min     58.930000  57.460000  84.260000  54.180000   75.6500  37.490000   
25%     65.512500  63.017500  88.105000  70.692500   82.0850  53.920000   
50%     79.290000  72.255000  89.700000  78.155000   87.3550  64.780000   
75%     80.977500  82.532500  92.620000  83.535000   90.4900  75.525000   
max     89.790000  89.380000  94.140000  88.200000   94.8500  77.550000   

           VA65A       X630  
count  10.000000  10.000000  
mean   63.805000  49.713000  
std    12.156905   9.516245  
min    38.810000  39.650000  
25%    56.092500  43.462500  
50%    68.050000  47.010000  
75%    72.567500  53.115000  
max    77.560000  72.790000  

Analysis¶

Let's look at the yearly median for the whole region.

Yearly Median¶

We can see that the yearly WQI median, apart from 58.21 in 2014, is greater than 64. In 2020, yearly WQI median achieved its highest value so far equal 75.205. In general, the region is closer to Low concern water quality than to High concern.

In [ ]:
# this is how many records per station is available
ONLY_DF10_REC_COUNT = 10
In [ ]:
plot_yearly_medians(grp_df_ = only10_df, rec_count=ONLY_DF10_REC_COUNT, 
    stations_count_df_= pd.DataFrame(only10_df.count(axis=1), columns = ["StationsCount"]), 
    period_=f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']}; {len(only10_df.columns)} stations reporting").show()

Now let's look at individual stations.

Individual Stations' WQI¶

Here is the summary of some stats for the group sorted in a descending order by median.

In [ ]:
# get only10 stats
only10_stats_df = crt_stat_df(only10_df).reset_index()
only10_stats_df = only10_stats_df.rename(columns= {'index' : 'Locator'})
only10_stats_df.index.name = 'index'
only10_stats_df
Out[ ]:
Locator min max median
index
0 F321 69.50 95.76 94.780
1 LSIN9 83.47 99.93 91.095
2 SKYKOMISH 84.26 94.14 89.700
3 HARRIS_1 80.94 92.96 89.205
4 GRIFFIN 84.17 92.85 88.050
5 TOLT_MTH 75.65 94.85 87.355
6 D320 77.65 91.62 87.255
7 CHERRY_1 79.36 91.11 87.015
8 B319 78.88 93.22 86.080
9 NFK_SNQ 74.85 91.10 85.455
10 A685 73.12 91.89 85.010
11 A320 71.27 88.81 83.775
12 438 66.29 87.90 83.370
13 RAGING_MTH 58.93 89.79 79.290
14 MFK_SNQ 67.49 84.86 79.085
15 C320 69.95 90.49 78.605
16 SNQDUVALL 54.18 88.20 78.155
17 A680 54.71 82.31 75.250
18 A631 52.59 80.58 74.985
19 N484 62.31 85.02 73.465
20 KTHA03 57.36 83.13 73.455
21 SFK_SNQ 57.46 89.38 72.255
22 440 51.59 81.81 71.435
23 BSE_1MUDMTNRD 51.06 76.31 70.945
24 442 47.38 86.24 70.285
25 G320 59.50 79.61 68.770
26 VA65A 38.81 77.56 68.050
27 631 53.75 79.11 68.035
28 A432 44.40 75.60 67.160
29 3106 58.55 72.85 66.955
30 A620 49.28 80.76 66.435
31 A617 53.50 85.30 66.025
32 C484 44.48 84.58 66.020
33 PATTER_3 53.48 77.12 65.430
34 VA42A 37.49 77.55 64.780
35 430 41.95 78.68 63.740
36 478 45.59 81.21 63.635
37 446 56.70 71.33 61.695
38 321 54.33 86.67 60.365
39 KTHA01 38.36 81.92 60.310
40 470 40.08 69.74 59.360
41 486 44.51 70.76 57.655
42 A456 29.52 67.86 56.655
43 434 33.16 64.05 54.550
44 AMES_1 43.03 76.14 52.815
45 A315 32.50 64.49 52.395
46 KSHZ06 36.44 65.34 51.425
47 C370 39.42 60.28 49.930
48 X630 39.65 72.79 47.010
49 B484 27.57 53.80 44.150
50 474 25.88 52.62 43.690
51 0450CC 32.79 65.77 42.590
52 444 21.33 58.70 41.870
53 322 23.69 63.99 41.200
54 317 25.09 43.92 29.350
55 LSIN1 7.13 86.65 20.670
In [ ]:
plot_grp_stats(only10_df, ONLY_DF10_REC_COUNT, f"{DELTA_YEARS['start']}-{DELTA_YEARS['end']}; {len(only10_df.columns)} stations", tickangle_=45).show()

In [ ]:
get_fig_density_scatter(only10_df,loc_df, only10_df.median(axis = 0),
                        "WQI_median",
                        f"Stations in King County, WA, with at least {ONLY_DF10_REC_COUNT} records for "
                        f"the period {DELTA_YEARS['start']}-{DELTA_YEARS['end']}.").show()
l_df=                                             SiteName        Locator  \
35                  Crisp Creek below hatchery intake           F321   
43                                   Ravensdale mouth          LSIN9   
50     South Fork Skykomish at hwy 2 mile marker 47.5      SKYKOMISH   
38                Harris Creek at Carnation Duvall Rd       HARRIS_1   
37           Griffin Creek approx 1.5 mi E of hwy 203        GRIFFIN   
52                                   Tolt River mouth       TOLT_MTH   
34          Jenkins Creek at Kent Black Diamond Rd SE           D320   
33                Cherry Creek at NE Cherry Valley Rd       CHERRY_1   
27                        Green River at 212th Way SE           B319   
46              North Fork Snoqualmie at 428th Ave SE        NFK_SNQ   
25       Ebright Creek mouth at E Lake Sammamish Pkwy           A685   
18         Soos Creek mouth below Soos Creek Hatchery           A320   
5                        Cedar River at Bronson Way N            438   
48                    Raging River mouth near Dike Rd     RAGING_MTH   
44             Middle Fork Snoqualmie at 428th Ave SE        MFK_SNQ   
30      Covington Creek at SE Auburn-Black Diamond Rd           C320   
51                 Snoqualmie River at Taylor Landing      SNQDUVALL   
24     Pine Lake Creek mouth at E Lake Sammamish Pkwy           A680   
23                Issaquah Creek upstream of hatchery           A631   
45                Cottage Lake Creek at Tolt Pipeline           N484   
41                 Venema Creek at Pipers Creek Trail         KTHA03   
49   South Fork Snoqualmie at Snoqualmie Valley Trail        SFK_SNQ   
6           May Creek mouth at Lake Washington Blvd N            440   
29            Boise Creek mouth at SE Mud Mountain Rd  BSE_1MUDMTNRD   
7                 Coal Creek upstream of 119th Ave SE            442   
36                   Little Soos Creek at SE 272nd St           G320   
54        Gorsuch Creek mouth near end of SW 167th St          VA65A   
15            Issaquah Creek mouth at NW Sammamish Rd            631   
19                          McAleer at Bothell Way NE           A432   
16                        Green River at Starfire Way           3106   
22              Idylwood Creek mouth at Idylwood Park           A620   
21                  Lewis Creek mouth at 187th Ave SE           A617   
32                           Bear Creek at NE 95th St           C484   
47  Patterson Creek mouth at W Snoqualmie River Rd SE       PATTER_3   
53                 Judd Creek near end of SW 225th St          VA42A   
3                  Lyon Creek mouth at Bothell Way NE            430   
13                 Little Bear mouth near NE 178th St            478   
9                Juanita Creek mouth at NE Juanita Dr            446   
1             Crisp Creek mouth at SE Green Valley Rd            321   
40                                      Piper's Creek         KTHA01   
11                Swamp Creek mouth at Bothell Way NE            470   
14                 Sammamish River at NE Marymoor Way            486   
20                       Forbes Creek at 108th Ave NE           A456   
4           Thornton Creek mouth at Sand Point Way NE            434   
26                    Ames Creek mouth at NE 100th St         AMES_1   
17          Mill Creek near 68th Ave S and S 262nd St           A315   
39                                 Pipers Creek mouth         KSHZ06   
31                    Longfellow Creek at SW Yancy St           C370   
55                 Tibbetts Creek mouth at footbridge           X630   
28                    Evans Creek at NE Union Hill Rd           B484   
12         North Creek mouth at Sammamish River Trail            474   
10                     Sammamish River at NE 145th St         0450CC   
8                       Mercer Slough at 121st Ave SE            444   
2                Newaukum Creek mouth at 212th Way SE            322   
0               Springbrook Creek mouth at SW 16th St            317   
42                                   Rock Creek mouth          LSIN1   

    WQI_median  
35      94.780  
43      91.095  
50      89.700  
38      89.205  
37      88.050  
52      87.355  
34      87.255  
33      87.015  
27      86.080  
46      85.455  
25      85.010  
18      83.775  
5       83.370  
48      79.290  
44      79.085  
30      78.605  
51      78.155  
24      75.250  
23      74.985  
45      73.465  
41      73.455  
49      72.255  
6       71.435  
29      70.945  
7       70.285  
36      68.770  
54      68.050  
15      68.035  
19      67.160  
16      66.955  
22      66.435  
21      66.025  
32      66.020  
47      65.430  
53      64.780  
3       63.740  
13      63.635  
9       61.695  
1       60.365  
40      60.310  
11      59.360  
14      57.655  
20      56.655  
4       54.550  
26      52.815  
17      52.395  
39      51.425  
31      49.930  
55      47.010  
28      44.150  
12      43.690  
10      42.590  
8       41.870  
2       41.200  
0       29.350  
42      20.670  
In [ ]:
""" Counts the number of records satisfying the condition threshold 
@param df_ dataframe we want to check the threshold
@param threshold_ >= threshold values will be counted
@param col_ 

@return number of records satisfied the condition
"""
def count_records(df_, threshold_, col_ = 'median'):    
    
    return df_[df_[col_] >= threshold_].shape[0]

def print_stat_counts(df_):
    count_all = df_.shape[0]
    count_ge80 = count_records(df_, 80)
    count_ge40 = count_records(df_, 40)
    count_less40 = count_all - count_ge40

    print(f"     median >= 80: {count_ge80} recs, i.e., {count_ge80/count_all}")
    print(f"80 > median >= 40: {count_ge40 - count_ge80} recs, ie., {(count_ge40 - count_ge80)/count_all} records.")
    print(f"     median <  40: {count_less40} records, i.e., {count_less40/count_all}")
    print(f"All records      : {count_all} recs.")


print_stat_counts(only10_stats_df)
     median >= 80: 13 recs, i.e., 0.23214285714285715
80 > median >= 40: 41 recs, ie., 0.7321428571428571 records.
     median <  40: 2 records, i.e., 0.03571428571428571
All records      : 56 recs.

80 > median >= 40: 41 recs, ie., 0.7321428571428571 records.
     median <  40: 2 records, i.e., 0.03571428571428571
All records      : 56 recs.

There are 13 stations ($\approx$ 23%) that recorded WQI with global median in the Low concern zone. They are located in the eastern and southern parts of the region, rather on the outskirts of the region.

There are two stations ($\approx$ 3.6%) with High concern WQIs: locator 317 (Springbrook Creek mouth at SW 16th St) with median WQI=29.35 and LSIN1* (Rock Creek mouth) with WQI median 20.67.

The highest WQI median has been recorded at F321 (Crisp Creek below hatchery intake) and it is 94.78. The second highest WQI median equals 91.095 and belongs to LSIN9 (Ravensdale mouth).

There are only two ($\approx$ 3.6%) High concern locators w.r.t. the WQI median. The majority, i.e. 41 out of 56 records ($\approx$ 73%), fall into the Modern concern category.

It seems that the water quality is better out of centers of bigger, densly populated areas.

In [ ]:
plot_yearly_WQI(only10_df, get_stratified_median_df(only10_stats_df), title_ = 
                "AnnualScore WQI for all locators in Kings County, WA,"
                f" with {ONLY_DF10_REC_COUNT} records"
                f" for years {DELTA_YEARS['start']}-{DELTA_YEARS['end']}.The legend is sorted in a "
                "descending order according to stations' WQI median."
                f" {len(only10_df.columns)} stations.").run_server(debug=True)

F321 has the best global median of all locators for a given period, however, in 2016 its WQI scored 69.5 what qualified into the Modern concern catogory. Locators LSIN9, SKYKOMISH, HARRIS_1 and GRIFFIN have been staying in the Low concern category for a given period.

Locator RAGING_MTH has been systematically improving (2020 - 58.93, 2021 - 63.23, 2022 - 87.36). Similarly, MFK_SNQ has started improving from 58.93 in 2020, to 63.23 in 2021, to 84.86 in 2022.

Locator SNQDUVALL has an interesting history Modern concern - Low concern switchbacks. After plumetting to 63.99 in 2020 from 83.94 in 2019, it has been on a steady rise: 75.77 in 2021, and 87.25 in 2022.

Locator A620 shows an optimistic trend of the recovery trajectory: 80.76 in 2020, 49.28 in 2021, and 71.99 in 2022. Similarly A617, C484, VA42A; however, not that dramatic.

Locator 430 has been systematically recovering since 2019, when WQI scored almost the High concern category with 42.5. Over three years it has improved to almost Low concern with WQI 78.68 in 2022.

Locator B484 advanced to Modern concern with 51.97 in 2022 from 27.8 in 2021.

Locator BSE_1MUDMNTRD is troublesome because its WQI has decreased from 76.31 in 2021 to 55.98 in 2022.

Locator SFK_SNQ has also a history of sharp decrease in WQI from 83.36 in 2019 to 57.46 in 2020. However, it has been on a steady rise toward Low concern - 80.05 in 2022.

Locator A685 has recently deteriorated (2021 - 75.32 and 2022 - 73.12). Similarly, locator A320, although improved to 75.39 in 2022 from 71.27 in 2021, it changed its status from 84.73 Low concern in 2020 to 71.27 Modern concern in 2021.

Locator LSIN1 has the worst median WQI. In 2014, it scored the Low concern zone with 86.65; however, the next year it plummeted dramatically to the High concern category with WQI = 11.25. It has never recovered and stays in the High concern category. In 2022 its WQI scored 18.52.

Summary¶

There were two questions raised:

  1. Q1: How has WQI changed over the years? Was WQI better in the past?

  2. Q2: How does WQI change with locators' geography? Is there any pattern such as more densely populated areas have WQI worse than sparsely populated areas? Does any of the areas such as north/south/east/west have cleaner water over the others?

The data timespan was 1970-2023 with 78 stations recording WQI.

In order to answer question 1. we can use the diagrams describing yearly medians:

Question 1: How WQI has changed over the years? Was WQI better in the past?¶

Years 1970-2023¶

1970-2023 for 23 stations with at least 46 records available each. They were covering a decent area of the region. If they lack the data it can concern years $\le$ 1978 and/or [2010;2014] inclusive. Based on the analysis of yearly median it stays around middle of the Modern concern ie WQI=60, mostly in the lower half of Modern concern. In recent recent years, i.e., 2017 and above, it has been above 60 (or very close to 60, 2019 median WQI was 59.11). Based on that we can look in the future of our waters in a cautiosly optimistic way.

We can see that the number of stations stabilized in 1979. The period 1970-1978 was time when service at many stations was interrupted. Boxplot values show that 2022 is a winner compared to 1979: max: 93.22 vs. 83.92, Q3: 76.68 vs. 75.21, median: 68.34 vs. 48.34, Q1: 54.62 vs. 34.5, and min. 31.56 vs. 1, 2022 and 1979, respectively. The 1979 distribution is right-skewed, meaning the higher WQI scores (Q3) are more dispersed than Q1 scores. The 2022 distribution is left-skewed meaning that Q3 is less dispersed.

Years 2013-2022¶

If we take a look at the yearly median of all stations with 10 records in 2013 and 2022, we can see a very little improvement 72.92 from 71.075.However, if we start the comparison in 2014, the yearly median was 58.21 across all stations. The highest median, 75.205, was in 2020. So basically, we can say that we are at the point where we were in 2013. Even the skeweness is similar, i.e., left-skewed, i.e., the data in Q3 is less dispersed, than in Q1. However, in terms of minimums and maximums, the data shows that 2022 is worse than 2013, 95.6 vs. 99.93 and 25.88 vs. 18.52, maximums and minimums respectively.

Conclusion¶

The conclusion that we might draw is that we are better off compared to the past (1979). Compared to 2013, median WQI has improved 1.845 points, but the minimums and maximums have worsened.

Q2: Are there any patterns in WQI w.r.t. to a locator's geographical location?¶

Years 1970-2023¶

The map for the period 1970-2023 shows that densly populated areas are mostly in the Modern concern category. The three Low concern stations are located in the southern part of the region. And even they are close to Modern concern or High concern. Locator D320 with WQI median 84.845 is relatively close to G320 with 65.945 and locator B319 with median 85.25 is nearby 322 with WQI median 39.33.

Years 2013-2022¶

The map for 2013-2022 with locators and median for 10 records shows that eastern and south parts of the region have the highest median WQI. However, there are stations that yield Low concern WQIs and stations that yield Modern concern or High concern WQIs and they are located in geographical vicinity. Examples are F321 94.78 and B319 with 86.08 and 321 with 60.365 and 322 with 47.2741, or LSIN9 with 91.095 and LSIN1 with 20.76.

Also 438 (Cedar River at Bronson Way N) in Renton (densely populated area) has outstanding WQI median of 83.37 for 10 years, whereas nearby 317 (*Springbrook Creek mouth at SW 16th St) has WQI 29.35.

Conclusion¶

From the analysis, it seems that WQI will be likely higher in certain parts of the region, i.e., eastern and southern parts, and the remaining parts are mostly in Modern concern category. However, it seems that it depends more on a particular waterway and in fact more research is required on why Low concern locators are nearby High concern and Modern concern locators.

References¶

[1] King County. County, “Water Quality Index Background.” King County Washington State, 2023. Available at: King County WQI